Video concept detection using multi-layer multi-instance learning

ABSTRACT

Visual concepts contained within a video clip are classified based upon a set of target concepts. The clip is segmented into shots and a multi-layer multi-instance (MLMI) structured metadata representation of each shot is constructed. A set of pre-generated trained models of the target concepts is validated using a set of training shots. An MLMI kernel is recursively generated which models the MLMI structured metadata representation of each shot by comparing prescribed pairs of shots. The MLMI kernel is subsequently utilized to generate a learned objective decision function which learns a classifier for determining if a particular shot (that is not in the set of training shots) contains instances of the target concepts. A regularization framework can also be utilized in conjunction with the MLMI kernel to generate modified learned objective decision functions. The regularization framework introduces explicit constraints which serve to maximize the precision of the classifier.

BACKGROUND

Due to rapid advances in video capture technology, the cost of videocapture devices has dropped greatly in recent years. As a result, videocapture devices have surged in availability and popularity. Videocapture functionality is now available to consumers on a mass marketlevel in a variety of different forms such as mobile phones, digitalcameras, digital camcorders, web cameras and the like. Additionally,laptop computers are also now available with integrated web cameras. Asa result, the quantity of digital video data being captured has recentlysurged to an unprecedented level. Furthermore, corollary advances indata storage, compression and network communication technologies havemade it cost effective for mass market consumers to store andcommunicate this video data to others. A wide variety of mass marketsoftware applications and other tools also now exist which provideconsumers with the ability to view, manipulate and further share thisvideo data for a variety of different purposes.

SUMMARY

This Summary is provided to introduce a selection of concepts, in asimplified form, that are further described hereafter in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

Video concept detection (VCD) technique embodiments described herein aregenerally applicable to classifying visual concepts contained within avideo clip based upon a prescribed set of target concepts, each conceptcorresponding to a particular semantic idea of interest. These techniqueembodiments and the classification resulting therefrom can be used toincrease the speed and effectiveness by which a video clip can bebrowsed and searched for particular concepts of interest. In oneexemplary embodiment, a video clip is segmented into shots and amulti-layer multi-instance (MLMI) structured metadata representation ofeach shot is constructed. A set of pre-generated trained models of thetarget concepts is validated using a set of training shots. An MLMIkernel is recursively generated which models the MLMI structuredmetadata representation of each shot by comparing prescribed pairs ofshots. The MLMI kernel can subsequently be utilized to generate alearned objective decision function which learns a classifier fordetermining if a particular shot (that is not in the set of trainingshots) contains instances of the target concepts. In other exemplaryembodiments, a regularization framework can be utilized in conjunctionwith the MLMI kernel to generate modified learned objective decisionfunctions. The regularization framework introduces explicit constraintswhich serve to maximize the precision of the classifier.

In addition to the just described benefits, other advantages of the VCDtechnique embodiments described herein will become apparent from thedetailed description which follows hereafter when taken in conjunctionwith the drawing figures which accompany the detailed description.

DESCRIPTION OF THE DRAWINGS

The specific features, aspects, and advantages of the video conceptdetection technique embodiments described herein will become betterunderstood with regard to the following description, appended claims,and accompanying drawings where:

FIG. 1 illustrates a diagram of an exemplary embodiment, in simplifiedform, of a multi-layer multi-instance (MLMI) framework representation ofa video clip.

FIGS. 2A, 2B and 2C illustrate diagrams of an exemplary embodiment, insimplified form, of an expanded node pattern set and associated nodepattern groups for the MLMI framework.

FIG. 3 illustrates an exemplary embodiment of a process for recursivelygenerating an MLMI kernel which models the MLMI framework by comparingpairs of shots in a video clip.

FIG. 4 illustrates a diagram of exemplary embodiments, in simplifiedform, of a regularization framework for MLMI learning/modeling whichintroduces explicit constraints into a learned objective decisionfunction which serve to restrict instance classification in sub-layersof the MLMI framework.

FIG. 5 illustrates an exemplary embodiment, in simplified form, of oneprocess for classifying visual concepts contained within a video clip.

FIG. 6 illustrates an exemplary embodiment, in simplified form, ofanother process for classifying visual concepts contained within a videoclip.

FIG. 7 illustrates a diagram of an exemplary embodiment, in simplifiedform, of a general purpose, network-based computing device whichconstitutes an exemplary system for implementing the video conceptdetection (VCD) technique embodiments described herein.

DETAILED DESCRIPTION

In the following description of video concept detection techniqueembodiments reference is made to the accompanying drawings which form apart hereof, and in which are shown, by way of illustration, specificexemplary embodiments in which the VCD technique can be practiced. It isunderstood that other embodiments can be utilized and structural changescan be made without departing from the scope of the techniqueembodiments.

1.0 Introduction To VCD

As is appreciated by those of skill in the art of video/film digitalprocessing, video annotation generally refers to a method for annotatinga video clip with metadata that identifies one or more particularattributes of the clip. Video concept detection (VCD) is one possiblemethod for performing video annotation based on a finite set ofparticular visual semantic concepts of interest (hereafter referred tosimply as “target concepts”). Generally speaking, the VCD techniqueembodiments described herein classify the visual concepts containedwithin a video clip based upon a prescribed set of target concepts, andthen generate structured metadata that efficiently describes theconcepts contained within the clip at both the semantic and syntacticlevels. These technique embodiments and the classification resultingtherefrom are advantageous in that they can be used to increase thespeed and effectiveness by which a video clip can be browsed andsearched for target concepts. This is especially advantageousconsidering the aforementioned surge in the quantity of digital videodata that is being captured, stored and communicated, and the largevolume of data associated with a typical set of digital video data(herein referred to as a “video clip” or simply a “clip”).

A fundamental step in performing the aforementioned classification of avideo clip is to first understand the semantics of the data for theclip. This step can be characterized as a learning or modeling process.It is noted that a semantic gap generally exists between the high-levelsemantics of a particular video clip and the low-level featurescontained therein. The VCD technique embodiments described herein areemployed as a way to narrow this semantic gap. As such, these VCDtechnique embodiments serve an important role in the aforementionedlearning/modeling process, and therefore also serve an important roletowards achieving an understanding of the semantics of a video clip. Theaforementioned structured metadata generated from these VCD techniqueembodiments can be used as the basis for creating a new generation ofmass market software applications, tools and systems for quickly andeffectively browsing video clips, searching the clips for targetconcepts, manipulating the clips, and communicating the clips to others.

2.0 MLMI Framework

Generally speaking, a video clip, which can include a plurality ofdifferent scenes along with one or more moving objects within eachscene, has distinctive structure characteristics compared to a singleimage of a single scene. More particularly, as will be described indetail hereafter, a clip intrinsically contains hierarchical multi-layermetadata structures and multi-instance data relationships. Accordingly,the VCD technique embodiments described herein are based on astructure-based paradigm for representing a clip as hierarchicallystructured metadata. As such, these technique embodiments areadvantageous since they avail themselves to the hierarchical multi-layermetadata structures and multi-instance data relationships containedwithin a clip. More particularly, as noted heretofore and as will bedescribed in more detail hereafter, these VCD technique embodimentsgenerally involve classifying the visual concepts contained within avideo clip and generating hierarchically structured metadata therefrom,where data relationships inside the clip are modeled using ahierarchical multi-layer multi-instance (MLMI) learning/modeling(hereafter also referred to simply as modeling) framework. As will bedescribed in more detail hereafter, this MLMI framework employs a rootlayer and a hierarchy of sub-layers which are rooted to the root layer.

Before the VCD technique embodiments described herein are applied to aparticular video clip, it is assumed that an MLMI structured metadatarepresentation of the clip has been constructed as follows. First, it isassumed that a conventional method such as a pixel value change-baseddetection method has been used to perform shot boundary detection on theclip such that the clip is segmented into a plurality of shots, thusconstructing what will hereafter be termed a “shot layer.” Each shot(hereafter also referred to as a rooted tree or a shot T) contains aseries of consecutive frames in the video that represent a distinctivecoherent visual theme. As such, a video clip typically contains aplurality of different shots. Second, it is assumed that a conventionalmethod such as a TRECVID (Text REtrieval Conference (TREC) VideoRetrieval Evaluation) organizer has been used to extract one or morekey-frames from each shot, thus constructing what will hereafter betermed a “key-frame sub-layer.” Each key-frame contains one or moretarget concepts. Third, it is assumed that a conventional method such asa J-value Segmentation (JSEG) method has been used to segment eachkey-frame into a plurality of key-regions and subsequently filtering outthose key-regions that are smaller than a prescribed size, thusconstructing what will hereafter be termed a “key-region sub-layer.”Finally, as will be described in more detail hereafter, it is assumedthat a plurality of low-level feature descriptors have been prescribedto describe the visual concepts contained in the shot layer, key-framesub-layer and key-region sub-layer, and that these various prescribedfeatures have been extracted from the shot layer, key-frame sub-layerand key-region sub-layer. Exemplary low-level feature descriptors forthese different layers will be provided hereafter.

Additionally, before the VCD technique embodiments described herein areapplied to a particular video clip, it is also assumed that thefollowing procedures have been performed. First, it is assumed that aconventional method such as a Large Scale Concept Ontology forMultimedia (LSCOM) method has been used to pre-define a set ofprescribed target concepts. Second, it is assumed that a conventionalmethod such as a Support Vector Machine (SVM) method has been used topre-generate a set of statistical trained models of this set of targetconcepts. Third, it is assumed that the trained models are validatedusing a set of training shots selected from the aforementioned pluralityof shots.

As will now be described in detail, each shot can generally beconceptually regarded as a “bag” which contains a plurality ofparatactic “concept instances” (hereafter referred to simply as“instances”). More particularly, each key-frame can be conceptuallyregarded as a bag and each region contained therein can be regarded asan instance. Each key-frame can thus be conceptually regarded as a bagof region instances. Instance classification labels are applied to eachkey-frame bag as follows. A particular key-frame bag would be labeled aspositive if at least one of the region instances within the bag fallswithin the target concepts; the key-frame bag would be labeled asnegative if none of the region instances within the bag falls within thetarget concepts. As will become clear from the MLMI frameworkdescription which follows, this bag-instance correspondence can befurther extended into the MLMI framework such that each shot can beconceptually regarded as a “hyper-bag” (herein also referred to as a“shot hyper-bag” for clarity) and each key-frame contained therein canalso be regarded as an instance. In this case, instance classificationlabels are also applied to each shot hyper-bag as follows. A particularshot hyper-bag would be labeled as positive if one or more of thekey-frame instances within the hyper-bag falls within the targetconcepts; the shot hyper-bag would be labeled as negative if none of thekey-frame instances within the hyper-bag fall within the targetconcepts. Each shot can thus be conceptually regarded as a hyper-bag ofkey-frame instances. This hyper-bag, bag, instance and bag-instanceterminology will be further clarified in the description which follows.

FIG. 1 illustrates a diagram of an exemplary embodiment, in simplifiedform, of an MLMI framework representation of a video clip. As depictedin FIG. 1, a video clip can be represented as a hierarchical three-layerstructure. A layer indicator l denotes each particular layer, and eachsuccessive layer down the hierarchy describes the visual conceptscontained in the clip in a higher degree of granularity compared to thedescription contained within the layer which precedes it in thehierarchy. As described heretofore, a video clip typically contains aplurality of different shots. The uppermost layer of the MLMI framework(i.e. the root layer) is termed a shot layer 100 which represents theshots 112/128 in the video clip. The shot layer 100 is denoted by l=1.The intermediate sub-layer contiguously beneath the shot layer 100 istermed a key-frame sub-layer 102 which represents the key-frames 104/106in each shot 112/128, where as described heretofore, each key-framecontains one or more target concepts. More particularly, key-frame 104is part of a first shot 112 in the clip in which an airplane 110 istaking off in the sky 114 over a city 116, where the airplane is atarget concept. Key-frame 106 is part of a second shot 128 in the clipin which two fish 118/120 and a jellyfish 122/124 are swimming in water126, where the jellyfish is a target concept. The key-frame sub-layer102 is denoted by l=2. The lowermost sub-layer contiguously beneath thekey-frame sub-layer 102 is termed a key-region sub-layer 108 whichrepresents one or more key-regions 110/114/122/124/126 within eachkey-frame 104/106 that have been filtered as described heretofore, whereeach filtered key-region contains a particular target concept. Moreparticularly, key-regions 110 and 114 represent the aforementionedairplane within key-frame 104, and key-regions 122, 124, and 126represent the aforementioned jellyfish in key-frame 106. The key-regionsub-layer 108 is denoted by l=3.

Referring again to FIG. 1, the bag-instance correspondences depictedtherein will now be considered. Key-frames 104 and 106 can beconceptually regarded as bags, where each bag contains a plurality ofregion instances contained therein. If an airplane is considered to be atarget concept, key-frame bag 104 would be labeled as positive since itcontains two key-region instances 110 and 114 that fall within thetarget concept of an airplane; key-frame bag 106 would accordingly belabeled as negative since it contains no key region instances that fallwithin the target concept of an airplane. On the other hand, if ajellyfish is considered to be a target concept, key-frame bag 106 wouldbe labeled as positive since it contains three key-region instances 122,124 and 126 that fall within the target concept of a jellyfish;key-frame bag 104 would accordingly be labeled as negative since itcontains no key-region instances that fall within the target concept ofa jellyfish. From a higher layer perspective, shots 112/128 can beconceptually regarded as hyper-bags, where each hyper-bag containskey-frame instances contained therein. If an airplane is againconsidered to be a target concept, hyper-bag 112 would be labeled aspositive since it contains a plurality of key-frame instances (e.g.,104) which fall within the target concept of an airplane; hyper-bag 128would accordingly be labeled as negative since it contains no key-frameinstances which fall within the target concept of an airplane. On theother hand, if a jellyfish is again considered to be a target concept,hyper-bag 128 would be labeled as positive since it contains a pluralityof key-frame instances (e.g., 106) which fall within the target conceptof a jellyfish; hyper-bag 112 would accordingly be labeled as negativesince it contains no key-frame instances which fall within the targetconcept of a jellyfish.

To summarize, in the MLMI framework described herein, a video clip isrepresented as a hierarchical “set-based” structure with bag-instancecorrespondence. Each successive layer down the hierarchy describes thevisual concepts contained in the clip in a higher degree of granularitycompared to the description contained within the layer which precedes itin the hierarchy. As will be described in more detail hereafter, thesevisual concept descriptions employed in the different layers can alsoinclude different modalities. As described heretofore, various low-levelfeature descriptors can be prescribed to describe the visual conceptscontained within the different layers. Referring again to FIG. 1, by wayof example, low-level feature descriptors such as camera motion, objectmotion within a scene, text and the like can be extracted from the shotlayer 100. By way of further example, low-level feature descriptors suchas color histogram, color moment, texture and the like can be extractedfrom the key-frame sub-layer 102. By way of yet further example,low-level feature descriptors such as object shape, object size, objectcolor and the like can be extracted from the key-region sub-layer 108.Furthermore, in the MLMI framework bag-instance correspondences existboth within each individual layer 100/102/108 as well as betweencontiguous layers in the hierarchy (i.e. between 100 and 102, andbetween 102 and 108). Additionally, in the MLMI framework a key-frame104 and 106 can be conceptually regarded as both a bag of key-regioninstances 110/114 and 122/124/126, as well as an instance within a shothyper-bag 112 and 128.

3.0 Introduction To Kernel-Based Modeling

This section provides a brief, general introduction to kernel-basedmodeling methods. In general, a kernel k which models metadatastructures within an input space X can be simplistically given by theequation k: X×X

R, where the input space X is mapped to either an n-dimensional vectorR^(n) or any other compound structure. For x,y ∈ X where x and yrepresent two different metadata structures within X, a kernel k(x,y)which models X, and compares x and y in order to determine a degree ofsimilarity (or difference) between x and y, can be given byk(x,y)=<φ(x),φ(y)>, where φ is a mapping from the input space X to ahigh-dimensional (most likely infinite) space Φ embedded with an innerproduct. In a general sense, kernel k(x,y) can also be given by thefollowing similarity measure:

d(φ(x),φ(y))=√{square root over (k(x,x)−2k(x,y)+k(y,y))}{square rootover (k(x,x)−2k(x,y)+k(y,y))}{square root over (k(x,x)−2k(x,y)+k(y,y))},  (1)

where dφ(x),φ(y)) represents a distance function in mapping space Φ.

4.0 MLMI Kernel

Generally speaking, this section provides a description of an exemplaryembodiment of an MLMI kernel which models the MLMI frameworkrepresentation of a video clip described heretofore by comparing pairsof shots in a video clip. More particularly, referring again to FIG. 1,it will be appreciated that the MLMI kernel described hereafterfacilitates performing VCD on a video clip by modeling the MLMIframework representation of a video clip by comparing prescribed pairsof shots 112 and 128 (hereafter referred to as shots T and T′) withinthe clip in order to determine a degree of similarity (or difference)there-between. As will be appreciated in the description which follows,this MLMI kernel enables structured metadata to be contained within alinear separable space without requiring that an explicit featuremapping operation be performed. Additionally, it will be appreciatedthat this MLMI kernel fuses the rich context information from thedifferent layers 100/102/108 in the hierarchy (i.e. for each layer,information describing its particular level in the overall hierarchy,information describing the low-level features contained therein, andinformation describing any sub-layer(s) that is linked thereto), therebyimproving the efficiency and effectiveness of the kernel.

Referring again to FIG. 1, in the MLMI framework each shot 112/128 cangenerally be considered an L-layer rooted tree (hereafter also referredto as a shot T) containing a connected acyclic directed graph of nodesn, where each node n is connected via a unique path to a root nodelocated in the shot/root layer 100. As will be described in more detailhereafter, both the key-frames 104/106 in the key-frame sub-layer 102and the key-regions 110/114/122/124/126 in the key-region sub-layer 108can be considered leaf nodes in the tree. Each node n representsstructured metadata of a certain granularity, where the granularity ofthe metadata increases as you progress down through each sub-layer102/108 in the hierarchy. The exemplary MLMI kernel embodiment describedhereafter models this structured metadata by sufficiently enumeratingall the sub-structures contained within each layer. L is the maximumlength of the unique path from the root node to the leaf nodes in thelowest sub-layer of the tree (i.e. the key-region sub-layer 108); or inother words, L is the total number of layers in the hierarchy. Thus, inthe exemplary MLMI framework depicted in FIG. 1 and as described herein,L=3. However, it is noted that another VCD technique embodiment is alsopossible in which L=2 and the MLMI framework simply includes theshot/root layer 100 and the key-frame sub-layer 102.

Given an L-layer tree corresponding to a particular shot T, the set ofnodes n contained within T can be given by the equation:

N={n _(i)}_(i=1) ^(|N|),   (2)

where |N| is the total number of nodes in T. If S is given to representa tree set and s_(i) is given to represent the set of node patternswhose parent node is n_(i), s_(i) can be given by the equation:

s _(i) ={s|s ∈ S

parent(s)=n _(i)}∈pow(S),   (3)

where pow(S) refers to the power set of S. Additionally, a bijectionmapping of n_(i)→s_(i) can be denoted. For each node n_(i) ∈N, a “nodepattern” of n_(i) can be defined to be all the metadata associated withn_(i), where this metadata is composed of the following three elements:layer information I_(i), low-level feature descriptor information f_(i),and tree sets s_(i) rooted at n_(i). f_(i) more particularly representsa set of low-level features in the video clip based on a plurality ofvarious modalities described heretofore. The node pattern of node n_(i)can then be given by the following triplet form equation:

{circumflex over (n)}_(i)=<l_(i),f_(i),s_(i)>  (4)

T can thus be expanded to the following node pattern set:

{circumflex over (N)}={{circumflex over (n)} _(i)}_(i−1) ^(|N|).   (5)

FIGS. 2A, 2B and 2C illustrate diagrams of an exemplary embodiment, insimplified form, of an expanded node pattern set {circumflex over (N)}and associated node pattern groups G_(l) for the MLMI framework depictedin FIG. 1. More particularly, referring again to FIG. 1, FIG. 2Aillustrates a node pattern group G₁ for the root layer of the tree (i.e.the shot layer 100, herein also referred to as the uppermost layer l=1).G₁ contains the root node n₀ which can be given by the equationn₀:(l₀,f₀), along with two exemplary tree sets s₀ 200/202 which arerooted at parent node n₀. According to equation (4), the node pattern ofroot node n₀ can be given by the equation {circumflex over(n)}₀=<l₀,f₀,s₀>. Nodes n₁ and n₂ correspond to key-frames 104 and 106respectively. Nodes n₃, n₄, n₅, n₆ and n₇ correspond to key-regions 110,114, 122, 124 and 126 respectively. FIG. 2B illustrates a node patterngroup G₂ for the intermediate sub-layer of the tree (i.e. the key-framesub-layer 102, herein also referred to as the intermediate layer l=2).G₂ contains the aforementioned nodes n₁-n₇ associated with theaforementioned two tree sets s₀. According to equation (4), the nodepattern of node n₁ can be given by the equation {circumflex over(n)}₁=<l₁,f₁,s₁>, and the node pattern of node n₂ can be given by theequation {circumflex over (n)}₂=<l₂,f₂,s₂>. FIG. 2C illustrates a nodepattern group G₃ for the lowest sub-layer of the tree (i.e. thekey-region sub-layer 108, herein also referred to as the lowermost layerl=3). G₃ contains the aforementioned nodes n₃-n₇. Node n₃ can be givenby the equation {circumflex over (n)}₃=<l₃,f₃, φ> and nodes n₄-n₇ can begiven by similar equations (e.g., {circumflex over (n)}₇=<l₇,f₇,φ>).

A kernel of trees (herein also referred to as shots T or T′) can now beconstructed using the expanded node pattern set given by equation (5).Based on the conventional definition of a convolution kernel, arelationship R can be constructed between an object and its parts. Thekernel for the composite structured object can be defined based on thecomposition kernels of the parts of the object. First, x,y ∈ X can bedefined to be the objects and {right arrow over (x)},{right arrow over(y)} ∈(X₁× . . . ×X_(D)) can be defined to be tuples of parts of theseobjects, where D is the number of parts in each object. Given therelationship R: (X₁× . . . ×X_(D))×X, x can then be decomposed asR⁻¹(x)={{right arrow over (x)}: R({right arrow over (x)},x)}. Based onthis decomposition, a convolution kernel k_(conv) for comparing metadatastructures x and y can be given by the following equation, with positivedefinite kernels on each part:

$\begin{matrix}{{k_{conv}\left( {x,y} \right)} = {\sum\limits_{{\overset{\rightharpoonup}{x} \in {R^{- 1}{(x)}}},{\overset{\rightharpoonup}{y} \in {R^{- 1}{(y)}}}}{\prod\limits_{{d = 1},\ldots \mspace{11mu},D}{{k_{d}\left( {x_{d},y_{d}} \right)}.}}}} & (6)\end{matrix}$

For the node pattern set {circumflex over (N)} defined in equation (5),R can be given by the set-membership equation x ∈ R⁻¹(X)

{right arrow over (x)} ∈ X where D=1. An MLMI kernel k_(MLMI) forcomparing two different shots T and T′ in order to determine a degree ofsimilarity (or difference) there-between can then be given by theequation:

$\begin{matrix}{{{k_{MLMI}\left( {T,T^{\prime}} \right)} = {\sum\limits_{{\hat{n} \in \hat{N}},{{\hat{n}}^{\prime} \in {\hat{N}}^{\prime}}}{k_{\hat{N}}\left( {\hat{n},{\hat{n}}^{\prime}} \right)}}},} & (7)\end{matrix}$

where {circumflex over (n)} is the node pattern of a particular node nin T and {circumflex over (n)}′ is the corresponding node pattern ofcorresponding particular node n′ in T′. Since {circumflex over (n)} and{circumflex over (n)}′ are each composed of three elements as given bytriplet form equation (4), k_({circumflex over (N)}) is a kernel on thistriplet space. Using the tensor product operation (K₁

K₂((x,u),(y,v))=K₁(x,y)K₂(u,v)), k_({circumflex over (N)}) can be givenby the equation:

k _({circumflex over (N)})({circumflex over (n)},{circumflex over(n)}′)=k _(δ)(l _(n) ,l _(n)′)×k _(f)(f _(n) ,f _(n)′)×k _(st)(s _(n) ,s_(n)′).   (8)

Generally speaking, k_(δ)(x,y)=δ_(x,y) is a matching kernel thatrepresents the layer structure for metadata structures x and y. Thus,k_(δ)(l_(n),l_(n)′) in equation (8) is a matching kernel that representsthe layer structure for {circumflex over (n)} and {circumflex over(n)}′, since, as described heretofore, l_(n) is the layer informationfor {circumflex over (n)} and l_(n)′ is the layer information for{circumflex over (n)}′. k_(f) is a feature-space kernel where f_(n) isthe low-level features in {circumflex over (n)} and f_(n)′ is thelow-level features in {circumflex over (n)}′. k_(st) is a kernel ofsub-trees where s_(n) is the set of node patterns in T whose parent isn, and s_(n)′ is the set of node patterns in T′ whose parent is n′. Byembedding a multi-instance data relationship into s_(n) and s_(n)′,k_(st) can be given by the equation:

k _(st)(s _(n) ,s _(n)′)=max_(ĉ∈s) _(n) _(,ĉ′∈s) _(n) _(′) {k_({circumflex over (N)})(ĉ,ĉ′)},   (9)

which indicates that the similarity of two different node patterns isaffected by the most similar pairs of their sub-structures. However,since the max function of equation (9) is non-differentiable, equation(9) can be approximated by choosing a conventional radial basis function(RBF) kernel for k_(f) in equation (8). As a result, k_(f) can beapproximated by the equation:

k _(f)(f _(n) ,f _(n)′)=exp(|f _(n) −f _(n)′|²/2σ²).   (10)

Using the definition of k_(f) given by equation (10), equation (9) abovecan then be approximated by the equation:

$\begin{matrix}{{{k_{st}\left( {s_{n},s_{n}^{\prime}} \right)} = {\sum\limits_{{\hat{C} \in S_{n}},{{\hat{C}}^{\prime} \in S_{n^{\prime}}^{\prime}}}{k_{\hat{N}}\left( {\hat{c},{\hat{c}}^{\prime}} \right)}}},} & (11)\end{matrix}$

where k_(st) is set to be 1 for leaf nodes (i.e. when s_(n),s_(n)′=0).

Since the maximal layer of T is L, the nodes can be divided into Lgroups given by {G_(l)}_(l=1) ^(L). As a result, {circumflex over (N)}can be transformed into a power set given by {circumflex over(N)}={G_(l)}_(l=1) ^(L) where G_(l)={{circumflex over (n)}_(l)|l_(l)=l}.Based upon the aforementioned matching kernel k_(δ), equation (7) can berewritten as the equation:

$\begin{matrix}{{k_{MLMI}\left( {T,T^{\prime}} \right)} = {\sum\limits_{l = 1}^{L}{\sum\limits_{{\hat{n} \in G_{l}},{{\hat{n}}^{\prime} \in G_{l}^{\prime}}}{{k_{\hat{N}}\left( {\hat{n},{\hat{n}}^{\prime}} \right)}.}}}} & (12)\end{matrix}$

k_(MLMI) given by equation (12) can be shown to be positive definite asfollows. As is appreciated in the art of kernel-based machine-learning,kernels are considered closed under basic operations such as sum(K₁+K₂), direct sum (K₁ ⊕ K₂), product (K₁×K₂) and tensor product (K₁

K₂). Since k_(MLMI) given by equation (12) is completely constructed ofthese basic operations (i.e. the direct sum in equations (7) and (10),and the tensor product in equation (8), it is closed and positivedefinite.

In order to avoid an undesirable scaling problem, a conventionalfeature-space normalization algorithm can then be applied to k_(MLMI)given by equation (12). More particularly, k_(MLMI) given by equation(12) can be normalized by the equation:

$\begin{matrix}{{k_{MLMI}\left( {T,T^{\prime}} \right)}_{NORM} = {\frac{k_{MLMI}\left( {T,T^{\prime}} \right)}{\sqrt{k_{MLMI}\left( {T,T} \right)} \times \sqrt{k_{MLMI}\left( {T^{\prime},T^{\prime}} \right)}}.}} & (13)\end{matrix}$

From the MLMI kernel defined in equation (12), it is noted that thekernel k_(MLMI) between two shots T and T′ is the combination of kernelsdefined on node patterns of homogeneous layers, and these node patternkernels are constructed based upon the intrinsic structure of the shotsutilizing the rich context and multiple-instance relationshipsimplicitly contained therein. Referring again to FIG. 1, it is furthernoted that this MLMI kernel is semi-positive definite when used tocompare shots T and T′ (herein also referred to as trees) or theirrelated sub-trees on both the same and different layers (e.g., shot 100to shot, shot to key-frame 102, shot to key-region 108, and key-frame tokey-region). It is further noted that equation (13) results in a numberthat indicates the relative degree of similarity between T and T′; thesmaller this number is the more similar T and T′ are.

4.1 Generating MLMI Kernel

FIG. 3 illustrates an exemplary embodiment of a process for recursivelygenerating, in a bottom-up manner, the MLMI kernel k_(MLMI)(T,T′)described heretofore which models the MLMI framework representation of avideo clip by comparing prescribed pairs of shots T and T′ within theclip in order to determine a degree of similarity (or difference)there-between. The total number of layers for the MLMI framework used torepresent the clip is assumed to be three (i.e. L=3). As depicted inFIG. 3, the process starts with inputting the two particular shots T andT′ 300. k_(MLMI)(T,T′) is then initialized to zero 302. A layerindicator l is then initialized to l=3 304 (i.e. the lowermostkey-region sub-layer). Then, a check is performed to determine if l=0306? In this case, since l≠0 306 (i.e. currently l=3), for each node n ∈G₃ and n′ ∈ G₃′, 308, if the nodes n and n′ are not on the lowermostleaf layer of G₃ 310, a kernel k_({circumflex over (N)}) for the nodepattern set {circumflex over (N)} associated with nodes n and n′ isgenerated by solving the equation

${k_{\hat{N}}\left( {\hat{n},{\hat{n}}^{\prime}} \right)} = {{k_{f}\left( {f_{n},f_{n}^{\prime}} \right)} \times {\sum\limits_{{\hat{c} \in s_{n}},{{\hat{c}}^{\prime} \in s_{n^{\prime}}^{\prime}}}{{k_{\hat{N}}\left( {\hat{c},{\hat{c}}^{\prime}} \right)}314.}}}$

If on the other hand the nodes n and n′ are on the lowermost leaf layerof G₃ 310, a kernel k_({circumflex over (N)}) for the node pattern set{circumflex over (N)} associated with nodes n and n′ is generated bysolving the equation k_({circumflex over (N)})({circumflex over(n)},{circumflex over (n)}′)=k_(f)(f_(n),f_(n)′) 312. Then,k_(MLMI)(T,T′) is updated by solving the equationsk_(MLMI)(T,T′)_(UD)=k_(MLMI)(T,T′)+k_({circumflex over (N)})({circumflexover (n)},{circumflex over (n)}′), and thenk_(MLMI)(T,T′)=k_(MLMI)(T,T′)_(UD) 316.

Referring again to FIG. 3, once k_(MLMI)(T,T′) has been generated forall the nodes n and n′ in G₃ and G₃′ 318, l is decremented by 1 320 suchthat l=2 (i.e. the intermediate key-frame sub-layer). Then, since l≠0306 (i.e. currently l=2), for each node n ∈ G₂ and n′ ∈ G₂′ 308, if thenodes n and n′ are not on the lowermost leaf layer of G₂ 310, a kernelk_({circumflex over (N)}) for the node pattern set {circumflex over (N)}associated with nodes n and n′ is generated by solving the equation

${k_{\hat{N}}\left( {\hat{n},{\hat{n}}^{\prime}} \right)} = {{k_{f}\left( {f_{n},f_{n}^{\prime}} \right)} \times {\sum\limits_{{\hat{c} \in s_{n}},{{\hat{c}}^{\prime} \in s_{n^{\prime}}^{\prime}}}{{k_{\hat{N}}\left( {\hat{c},{\hat{c}}^{\prime}} \right)}314.}}}$

If on the other hand the nodes n and n′ are on the lowermost leaf layerof G₂ 310, a kernel k_({circumflex over (N)}) for the node pattern set{circumflex over (N)} associated with nodes n and n′ is generated bysolving the equation k_({circumflex over (N)})({circumflex over(n)}{circumflex over (n)}′)=k_(f)(f_(n),f_(n)′) 312. Then,k_(MLMI)(T,T′) is updated by solving the equationsk_(MLMI)(T,T′)_(UD)=k_(MLMI)(T,T′)+k_({circumflex over (N)})({circumflexover (n)},{circumflex over (n)}′), and thenk_(MLMI)(T,T′)=k_(MLMI)(T,T′)_(UD) 316.

Referring again to FIG. 3, once k_(MLMI)(T,T′) has been generated forall the nodes n and n′ in G₂ and G₂′ 318, l is again decremented by 1320 such that l=1 (i.e. the uppermost shot layer). Then, since l≠0 306(i.e. currently l=1), for each node n ∈ G₁ and n′ ∈ G₁′ 308, if thenodes n and n′ are not on the lowermost leaf layer of G₁ 310, a kernelk_({circumflex over (N)}) for the node pattern set {circumflex over (N)}associated with nodes n and n′ is generated by solving the equation

${k_{\hat{N}}\left( {\hat{n},{\hat{n}}^{\prime}} \right)} = {{k_{f}\left( {f_{n},f_{n}^{\prime}} \right)} \times {\sum\limits_{{\hat{c} \in s_{n}},{{\hat{c}}^{\prime} \in s_{n^{\prime}}^{\prime}}}{{k_{\hat{N}}\left( {\hat{c},{\hat{c}}^{\prime}} \right)}314.}}}$

If on the other hand the nodes n and n′ are on the lowermost leaf layerof G₁ 310, a kernel k_({circumflex over (N)}) for the node pattern set{circumflex over (N)} associated with nodes n and n′ is generated bysolving the equation k_({circumflex over (N)})({circumflex over(n)},{circumflex over (n)}′)=k_(f)(f_(n),f_(n)′) 312. Then,k_(MLMI)(T,T′) is updated by solving the equationsk_(MLMI)(T,T′)_(UD)=k_(MLMI)(T,T′)+k_({circumflex over (N)})({circumflexover (n)},{circumflex over (n)}′), and thenk_(MLMI)(T,T′)=k_(MLMI)(T,T′)_(UD) 316.

Referring again to FIG. 3, once k_(MLMI)(T,T′) has been generated forall the nodes n and n′ in G₁ and G₁′ 318, l is again decremented by 1320 such that l=0 306. Then, k_(MLMI)(T,T′) can be normalized by solvingequation (13) 322.

4.2 VCD Using SVM With MLMI Kernel (SVM-MLMIK Technique)

The exemplary MLMI kernel embodiment described heretofore can becombined with any appropriate supervised learning method such as theconventional Support Vector Machine (SVM) method in order to performimproved VCD on a video clip. This section provides a description of anexemplary embodiment of an SVM-MLMIK VCD technique which combines theaforementioned MLMI kernel with the SVM method. It is noted that theconventional SVM method can generally be considered a single-layer (SL)method. As such, the conventional SVM method is herein also referred toas the SVM-SL method.

Generally speaking, in the paradigm of structured learning/modeling, thegoal is to learn an objective decision function f(x): X→Y from astructured input space X to response values in Y. Referring again toFIG. 1, in the exemplary MLMI framework embodiment described heretofore,determinate visual concepts are contained within the root layer (i.e.uppermost layer l=1 or shot layer 100), while the visual concepts andrelated low-level features contained within the sub-layers (i.e.intermediate layer l=2 (or key-frame sub-layer 102) and lowermost layerl=3 (or key-region sub-layer 108)) are unknown. As is appreciated in theart of kernel-based machine-learning, the SVM-SL method can beconsidered a maximum margin instance classifier in that it finds aseparating hyper-plane that maximizes the separation (i.e. the margin)between positively labeled instances and negatively labeled instances.

Given J different training shots x_(i) segmented from a structured inputspace X, and related instance classification labels y_(i) for x_(i)which are given by the equation (x₁,y₁), . . . ,(x_(J),y_(J)) ∈ X×Y,where Y={−1,1}, once a structured metadata kernel model k_(x) isdetermined for X, the learning/modeling process can then be transformedto an SVM-based process as follows. The dual form of the objectivedecision function in the SVM-SL method can be given by the equation:

$\begin{matrix}{{\min\limits_{\alpha}\left\{ {{\frac{1}{2}\alpha^{T}Q\; \alpha} - {1^{T}\alpha}} \right\}}{{{{{s.t.\mspace{14mu} \alpha^{T}}y} = 0};{0 \leq \alpha \leq C}},}} & (14)\end{matrix}$

where Q is a Gram matrix given by the equationQ_(ij)=y_(i)y_(j)k_(x)(x_(i),x_(j)) k_(x)(x_(i),x_(j)) is the kernelmodel, 1 is a vector of all ones, α∈ R^(J) is a prescribed coefficientin the objective decision function, y ∈ R^(J) is an instanceclassification label vector, and C is a prescribed constant whichcontrols the tradeoff between classification errors and the maximummargin. Parameters α and C can be optimized using a conventional gridsearch method.

Based on equation (14), an SVM-SL-based learned objective decisionfunction f(x) can be given by the equation:

$\begin{matrix}{{{f(x)} = {{sign}\left( {{\sum\limits_{{i = 1},\ldots \mspace{14mu},J}{y_{i}\alpha_{i}{k_{x}\left( {x_{i},x} \right)}}} + b} \right)}},} & (15)\end{matrix}$

wherein b represents a bias coefficient. The learned objective decisionfunction f(x) of equation (15) can then be improved by substituting theMLMI kernel of equation (12) for the kernel k_(x), resulting in anSVM-MLMIK learned objective decision function f′(x) which can be givenby the equation:

$\begin{matrix}{{f^{\prime}(x)} = {{{sign}\left( {{\sum\limits_{{i = 1},\ldots \mspace{14mu},J}{y_{i}\alpha_{i}{k_{MLMI}\left( {x_{i},x} \right)}}} + b} \right)}.}} & (16)\end{matrix}$

In tested embodiments of the SVM-MLMIK technique σ and C were set asfollows. σ was specified to vary from 1 to 15 with a step size of 2, andC was specified to be the set of values {2⁻²2,2^(−1,) . . . 2⁵}.

4.3 VCD Process Using SVM-MLMIK Technique

FIG. 5 illustrates an exemplary embodiment, in simplified form, of anSVM-MLMIK technique-based process for classifying visual conceptscontained within a video clip (herein also termed performing VCD on theclip) based upon a prescribed set of target concepts. As depicted inFIG. 5, the process starts with segmenting the clip into a plurality ofshots 500. An MLMI structured metadata representation of each shot isthen constructed 502, where one or more key-frames are first extractedfrom each shot 504, followed by each key-frame being segmented into aplurality of key-regions 506, followed by the key-regions for eachkey-frame being filtered to filter out those that are smaller than aprescribed size 508. A set of pre-generated trained models of the targetconcepts is validated 510 using a set of training shots which areselected from the plurality of shots. An MLMI kernel is recursivelygenerated 512, where this kernel models the MLMI structured metadatarepresentation of each shot by comparing prescribed pairs of shots. TheMLMI kernel is then utilized to generate a learned objective decisionfunction 514 corresponding to the SVM-MLMIK technique, where thisdecision function learns a classifier for determining if a particularshot (that is not in the set of training shots) contains instances ofthe target concepts.

5.0 Regularization Framework

As is appreciated in the art of kernel-based machine-learning, not allinstances in a positive bag should necessarily be labeled as positive.Accordingly, different instances in the same bag should ideally havedifferent contributions to the kernel. In support of this ideal, thissection provides a description of exemplary embodiments of aregularization framework for MLMI learning/modeling which introducesexplicit constraints into the learned objective decision functiondescribed heretofore, where these constraints serve to restrict instanceclassification in the sub-layers of the MLMI framework.

As is appreciated in the art of kernel-based machine-learning, instanceclassification precision primarily depends on the kernel that isemployed in the instance classification process. Referring again to FIG.1, it is noted that in the MLMI framework and related MLMI kerneldescribed heretofore, coherence of the metadata structures in thedifferent layers 100/102/108 is maintained in a multi-instance mannerwhich aids in the kernel's ability to deal with incorrectly labelednoisy data and more effectively model the metadata structures within theaforementioned hyper-plane. However, the MLMI kernel also possesses thefollowing limitation. Since the MLMI kernel determines instanceclassification labels for the shot hyper-bags 112/128 without knowingthe instance classification labeling for the key-frame and key-regionsub-layers 102/108, the only knowledge the MLMI kernel has of thissub-layer labeling is that which is implicitly given by themulti-instance data relationships in the shot layer 100. Similarly,since the MLMI kernel determines instance classification labels for thekey-frame bags 104/106 without knowing the instance classificationlabeling for the key-region sub-layer 108, the only knowledge the MLMIkernel has of this sub-layer labeling is that which is implicitly givenby the multi-instance data relationships in the key-frame sub-layer 102.As a result, the MLMI kernel cannot completely explore the conceptinstances in the sub-layers and therefore can only determine “weak”instance classification labels for the target concepts within each layer100/102/108. This limitation results in an ambiguity in the instanceclassification labels that are determined for the different layers.Since this instance classification label ambiguity propagates throughthe sub-layers 102 and 108 in the hierarchy, this phenomenon is termedan “ambiguity propagation” limitation. As will be appreciated from thedetailed description of the regularization framework embodiments whichfollow, the regularization framework addresses the ambiguity propagationlimitation of the MLMI kernel by directly modeling the multi-instancedata relationships across each of the layers in the MLMI framework, thusreducing the propagation of instance classification label ambiguitythrough the sub-layers 102 and 108.

FIG. 4 illustrates a diagram of exemplary embodiments, in simplifiedform, of the regularization framework. The left side of FIG. 4 depictsthe exemplary MLMI framework representation of a video clip previouslydepicted in FIG. 1. As depicted in FIG. 4, and referring again to FIGS.1 and 2A-2C, the regularization framework 400 introduces explicitconstraints 402/404/406 into the learned objective decision functionwhich serve to restrict instance classification in the sub-layers102/108 of the MLMI framework. The gauge 408-415 to the left of eachimage in the MLMI framework represents the instance classification ofthe particular node that corresponds to the image (e.g., gauge 404represents the instance classification of node n₁ that corresponds tokey-frame image 104). A completely full gauge to the left of aparticular image would represent a positive instance classificationlabel for its corresponding node (i.e. a 100 percent match to the targetconcepts), and a completely empty gauge to the left of a particularimage would represent a negative instance classification label for itscorresponding node (i.e. a zero percent match to the target concepts).Gauge 416 represents a ground-truth for the target concepts andaccordingly, this gauge is completely full. In the example provided inFIG. 4 the target concepts include an airplane. Thus, gauge 408represents the shot layer 100 instance classification of node n₀ to thetarget concept of an airplane. Gauges 409 and 410 represent thekey-frame sub-layer 102 instance classifications of nodes n₁ and n₂respectively to this target concept. Gauges 411-415 represent thekey-region sub-layer 108 instance classifications of nodes n₃ - n₇respectively to this target concept. Gauge 417 represents the shot layerinstance classification to this target concept. Gauge 418 represents thekey-frame sub-layer instance classification to this target concept.Gauge 419 represents the key-region sub-layer instance classification tothis target concept.

Referring again to FIG. 4, the regularization framework 400 generallyintroduces the following explicit constraints to the final stageminimization procedure 420, an example of which is given by equation(14), associated with the learned objective decision function.Constraint A 402 takes into consideration the ground-truth 416 for thetarget concepts and the instance classification labels for the shots inthe shot layer 417. Constraint A serves to minimize the shot (heretoforealso referred to as the tree or shot T) instance classification errors,thus maximizing the precision of the classifier on the shots. ConstraintB 404 takes into consideration the ground-truth 416, the instanceclassification labels for the key-frames in the key-frame sub-layer 418and the instance classification labels for the sets of filteredkey-regions in the key-region sub-layer 419. Constraint B serves tominimize the key-frame sub-layer and key-region sub-layer instanceclassification errors, thus maximizing the precision of the classifieron the sub-layers compared to the ground-truth 416. Constraint C 406takes into consideration the instance classification labels for theshots in the shot layer 417, the instance classification labels for thekey-frames in the key-frame sub-layer 418 and the instanceclassification labels for the sets of filtered key-regions in thekey-region sub-layer 419. Constraint C serves to minimize theinter-layer inconsistency penalty, which measures the consistencybetween the key-frame sub-layer instance classification labels, thekey-region sub-layer instance classification labels, and the shot layerinstance classification labels. Based on the multiple-instancerelationships in the MLMI framework, the maximum of the sub-layerinstance classification labels should be consistent with the shot layerinstance classification labels. As will be described in more detailhereafter, the regularization framework 400 also serves to minimize theoverall complexity of the classifier 422.

Given a structured metadata set {(T_(i),y_(i))}_(i−1) ^(N) ¹ where T_(i)∈

is an L-layer rooted tree of training shots T_(i), y_(i) ∈ R is theinstance classification label for T_(i), and N¹ is the number oftraining shots that are employed. Generally speaking, the objective isto learn a function f(T):

R from structured input space

to response values in R. Without loss of generality, the overall set ofnodes

in the training set can be re-indexed sequentially, from the uppermostlayer (i.e. the root layer) to the lowermost sub-layer, into therelationship equation:

$\begin{matrix}{{:\left\{ {T_{1},\ldots \mspace{14mu},T_{N^{1}},T_{11}^{2},\ldots \mspace{14mu},T_{N^{1}N_{N^{1}}^{1}}^{2},\ldots \mspace{14mu},T_{11}^{L},\ldots \mspace{14mu},T_{N^{1}N_{N^{1}}^{L}}^{L}} \right\}},} & (17)\end{matrix}$

where T_(im) ^(l) is the m^(th) sub-structure in the l^(th) layer forshot T_(i), and N_(i) ^(l) is the number of sub-structures in the l^(th)layer for shot T_(i).

If H is given to be the Reproducing Kernel Hilbert Space (RKHS) offunction f(T), and ∥f∥_(H) ² is given to be the RKHS norm of function f,the mathematical optimization problem in MLMI learning/modeling can begiven by the equation:

$\begin{matrix}{{\min\limits_{f \in H}{\frac{1}{2}{f}_{H}^{2}}} + {\sum\limits_{i = 1}^{N^{1}}{{V\left( {y_{i},{f\left( T_{i} \right)},{{\left\{ {f\left( T_{im}^{l} \right)} \right\}_{m,l}m} = 1},\ldots \mspace{14mu},{N_{i}^{l};{l = 2}},\ldots \mspace{14mu},L} \right)}.}}} & (18)\end{matrix}$

Referring again to FIG. 4, the first term in equation (18) is aregularization term which measures the complexity of the classifier 422,and the second term in equation (18) refers to the aforementionedconstraints A 402, B 404 and C 406 that must be met by the learnedfunction f. More particularly, constraint A 402 is a conventional Hingeloss function given by V₁(y_(i),f(T_(i) ¹)) which measures thedisagreement between the instance classification labels y_(i) for theground-truth 416 and the root layer 100 (i.e. shot layer) classificationf(T_(i) ¹) of the shots 417. Constraint B 404 is a Hinge loss functiongiven by {tilde over (V)}_(l)(y_(i), {f(T_(im) ^(l))}_(m)) whichmeasures the disagreement between the instance classification labelsy_(i) for the ground-truth 416, the key-frame sub-layer 102classification {f(T_(im) ²)}_(m) of the shots 418, and the key-regionsub-layer 108 classification {f(T_(im) ³)}_(m) of the shots 419.Constraint C 406 is an inter-layer inconsistency loss function given byV_(l)(f(T_(i) ¹),{f(T_(m) ^(l))}_(m)) which measures the disagreementbetween the shot layer 100 classification f(T_(i) ¹) of the shots 417,the key-frame sub-layer 102 classification {f(T_(im) ²)}_(m) of theshots 418, and the key-region sub-layer 108 classification {f(T_(im)³)}_(m) of the shots 419.

The regularization framework described heretofore can be combined withthe MLMI framework and related MLMI kernel described heretofore,resulting in an improved MLMI kernel which models the MLMI framework andthe multi-instance data relationships that are contained therein in amore straightforward manner. As such, the regularization framework canbe combined with the SVM-MLMIK technique described heretofore in orderto maximize the instance classification precision when performing VCD ona video clip compared to the instance classification precision producedby SVM-MLMIK technique without this regularization framework. It isnoted that various different embodiments of the regularization frameworkare possible, where the different embodiments employ differentcombinations of constraints A, B and C and different loss functions forV. It is also noted that among these three constraints, constraint A isconsidered the most important since it is focused on classifying theshots. Constraints B and C are considered comparatively less importantthan constraint A since they are focused on restricting the instanceclassification function. Thus, constraint A is employed in all of thedifferent embodiments of the regularization framework that will bedescribed hereafter. It is also noted that the SVM-MLMIK techniquedescribed heretofore also employs constraint A. However, the SVM-MLMIKtechnique does not take advantage of constraints B and C. Threedifferent particular embodiments of the regularization framework willnow be described, one which employs constraints A and B, another whichemploys constraints A and C, and yet another which employs constraintsA, B and C.

5.1 VCD Using SVM-MLMIK Technique With Regularization FrameworkEmploying Constraints A and B (MLMI-FLCE Technique)

This section provides a description of an exemplary embodiment of a VCDtechnique which combines the SVM-MLMIK technique described heretoforewith an embodiment of the regularization framework described heretoforethat employs the combination of constraint A (which serves to minimizethe shot layer instance classification errors) and constraint B (whichserves to minimize the key-frame sub-layer and key-region sub-layerinstance classification errors). This particular embodiment of theregularization framework employing constraints A and B is hereafterreferred to as the full layers classification error (FLCE) approach, andthis corresponding particular embodiment of the VCD technique ishereafter referred to as the MLMI-FLCE technique.

Referring again to FIG. 4, in the MLMI structured metadata frameworkdescribed heretofore, the instance classification labels for each layer100/102/104 should be consistent with the ground-truth 416. Thecombination of constraints A 402 and B 404 in the regularizationframework 400 serves to penalize the instance classification errors overall the L layers 100/102/104 in the MLMI framework. In this particularembodiment, the loss function for each shot can be given by theequation:

$\begin{matrix}{{V\left( {y_{i},{f\left( T_{i} \right)},\left\{ {f\left( T_{im}^{l} \right)} \right\}_{m,l}} \right)} = {{\lambda_{1}{V_{1}\left( {y_{i},{f\left( T_{i} \right)}} \right)}} + {\sum\limits_{l = 2}^{L}{\lambda_{l}{{{\overset{\sim}{V}}_{l}\left( {y_{i},\left\{ {f\left( T_{im}^{l} \right)} \right\}_{m}} \right)}.}}}}} & (19)\end{matrix}$

As in the conventional SVM-SL method described heretofore, aconventional Hinge Loss function can be used to represent the shot layerclassification errors by the equationV₁(y_(i),f(T_(i)))=max(0,1−y_(i)f(T_(i))). The classification errors forthe sub-layers can be given by the equation:

$\begin{matrix}{{{{\overset{\sim}{V}}_{l}\left( {y_{i},\left\{ {f\left( T_{im}^{l} \right)} \right\}_{m}} \right)} = {\overset{\sim}{E}\left( {y_{i},{\max\limits_{m}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \right)}},} & (20)\end{matrix}$

where the max function is adopted to reflect the multi-instance datarelationships in the MLMI framework. λ₁ and λ_(l) in equation (19) areprescribed regularization constants which determine a tradeoff betweenthe shot layer classification errors and the key-frame and key-regionsub-layer classification errors. {tilde over (E)}

in equation (20) can be defined in a variety of different ways. If aweak restriction is used that penalizes only if a particular instanceclassification label and the ground-truth are inconsistent by sign,{tilde over (E)}

in equation (20) can be given by the equation:

$\begin{matrix}{{\overset{\sim}{E}\left( {y_{i},{\max\limits_{m}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \right)} = {{\max \left( {0,{{- y_{i}} \times {\max\limits_{m}\left\{ {f\left( T_{im}^{l} \right)} \right\}}}} \right)}.}} & (21)\end{matrix}$

Finally, the learned objective decision function for the MLMI-FLCEtechnique can be given by the following quadratic concave-convexmathematical optimization equation:

$\begin{matrix}{{\min\limits_{f \in H}{\frac{1}{2}{f}_{H}^{2}}} + {\sum\limits_{i = 1}^{N^{1}}{\lambda_{1}{V_{1}\left( {y_{i},{f\left( T_{i} \right)}} \right)}}} + {\sum\limits_{i = 1}^{N^{1}}{\sum\limits_{l = 2}^{L}{\lambda_{l}{{\overset{\sim}{E}\left( {y_{i},{\max\limits_{m}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \right)}.}}}}} & (22)\end{matrix}$

5.2 VCD Using SVM-MLMIK Technique With Regularization FrameworkEmploying Constraints A and C (MLMI-ILCC Technique)

This section provides a description of an exemplary embodiment of a VCDtechnique which combines the SVM-MLMIK technique described heretoforewith an embodiment of the regularization framework described heretoforethat employs the combination of aforementioned constraint A (whichserves to minimize the shot layer instance classification errors) andconstraint C (which serves to minimize the inter-layer inconsistencypenalty). This particular embodiment of the regularization frameworkemploying constraints A and C is hereafter referred to as theinter-layer consistency constraint (ILCC) approach, and thiscorresponding particular embodiment of the VCD technique is hereafterreferred to as the MLMI-ILCC technique.

Referring again to FIG. 4, the use of constraint C 406 in theregularization framework 400 is motivated by the fact that the key-frameand key-region sub-layer instance classification labels 418/419 shouldbe consistent with the shot layer instance classification labels 417. Inthis particular embodiment, the loss function for each shot can be givenby the equation:

$\begin{matrix}{{V\left( {y_{i},{f\left( T_{i} \right)},\left\{ {f\left( T_{im}^{l} \right)} \right\}_{m,l}} \right)} = {{\lambda_{1}{V_{1}\left( {y_{i},{f\left( T_{i} \right)}} \right)}} + {\sum\limits_{l = 2}^{L}{\lambda_{l}{{V_{l}\left( {{f\left( T_{i} \right)},\left\{ {f\left( T_{im}^{l} \right)} \right\}_{m}} \right)}.}}}}} & (23)\end{matrix}$

The aforementioned conventional Hinge loss function can be used torepresent the shot layer classification errors as described in theMLMI-FLCE technique heretofore. The loss function for the inter-layerinconsistency penalty can be given by the equation:

$\begin{matrix}{{V_{l}\left( {{f\left( T_{i} \right)},\left\{ {f\left( T_{im}^{l} \right)} \right\}_{m}} \right)} = {{E\left( {{f\left( T_{i} \right)},{\max\limits_{m}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \right)}.}} & (24)\end{matrix}$

λ₁ and λ_(l) in equation (23) are prescribed regularization constantswhich determine a tradeoff between the shot layer instanceclassification errors and inter-layer inconsistency penalty. Variousdifferent loss functions can be employed for V in equation (24), such asa conventional L1 loss function, a conventional L2 loss function, etc.In the technique embodiments described herein, an L1 loss function isemployed for V. As is appreciated in the art of kernel-basedmachine-learning, the L1 loss function is defined by the equationE_(L1)(a,b)=|a−b|. As a result, {tilde over (E)}

in equation (24) can be given by the equation:

$\begin{matrix}{{E\left( {{f\left( T_{i} \right)},{\max\limits_{m}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \right)} = {{{{f\left( T_{i} \right)} - {\max\limits_{m}\left\{ {f\left( T_{im}^{l} \right)} \right\}}}}.}} & (25)\end{matrix}$

Thus, the learned objective decision function for the MLMI-ILCCtechnique can be given by the following quadratic concave-convexmathematical optimization equation:

$\begin{matrix}{{\min\limits_{f \in H}{\frac{1}{2}{f}_{H}^{2}}} + {\sum\limits_{i = 1}^{N^{1}}{\lambda_{1}{V_{1}\left( {y_{i},{f\left( T_{i} \right)}} \right)}}} + {\sum\limits_{i = 1}^{N^{1}}{\sum\limits_{l = 2}^{L}{\lambda_{l}{{E\left( {{f\left( T_{i} \right)},{\max\limits_{m}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \right)}.}}}}} & (26)\end{matrix}$

5.3 VCD Using SVM-MLMIK Technique With Regularization FrameworkEmploying Constraints A, B and C (MLMI-FLCE-ILCC Technique)

This section provides a description of an exemplary embodiment of a VCDtechnique which combines the SVM-MLMIK technique described heretoforewith an embodiment of the regularization framework described heretoforethat employs the combination of constraint A (which serves to minimizethe shot layer instance classification errors), constraint B (whichserves to minimize the key-frame and key-region sub-layer instanceclassification errors), and constraint C (which serves to minimize theinter-layer inconsistency penalty). This particular embodiment of theregularization framework employing constraints A, B and C is hereafterreferred to as the MLMI-FLCE-ILCC technique.

Based on the descriptions of the MLMI-FLCE and MLMI-ILCC techniquesheretofore, the loss function for each shot can be given by theequation:

$\begin{matrix}{{V\left( {y_{i},{f\left( T_{i} \right)},\left\{ {f\left( T_{im}^{l} \right)} \right\}_{m,l}} \right)} = {{\lambda_{1}{V_{1}\left( {y_{i},{f\left( T_{i} \right)}} \right)}} + {\sum\limits_{l = 2}^{L}{{\overset{\sim}{\lambda}}_{l}{{\overset{\sim}{V}}_{l}\left( {y_{i},\left\{ {f\left( T_{im}^{l} \right)} \right\}_{m}} \right)}}} + {\sum\limits_{l = 2}^{L}\; {\lambda_{l}{{V_{l}\left( {{f\left( T_{i} \right)},\left\{ {f\left( T_{im}^{l} \right)} \right\}_{m}} \right)}.}}}}} & (27)\end{matrix}$

λ₁, {tilde over (λ)}_(l) and λ_(l) in equation (27) are prescribedregularization constants which determine a tradeoff between theconstraints A, B and C. Using the equations for V₁, {tilde over (V)}_(l)and V_(l) provided heretofore, and assuming an L1 loss function isemployed for V as described heretofore, the learned objective decisionfunction for the MLMI-FLCE-ILCC technique can be given by the followingquadratic concave-convex mathematical optimization equation:

$\begin{matrix}{{\min\limits_{f \in H}{\frac{1}{2}{f}_{H}^{2}}} + {\sum\limits_{i = 1}^{N^{1}}{\lambda_{1}{V_{1}\left( {y_{i},{f\left( T_{i} \right)}} \right)}}} + {\sum\limits_{i = 1}^{N^{1}}\; {\sum\limits_{l = 2}^{L}{{\overset{\sim}{\lambda}}_{l}{\overset{\sim}{E}\left( {y_{i},{\max\limits_{m}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \right)}}}} + {\sum\limits_{i = 1}^{N^{1}}{\sum\limits_{l = 1}^{L}{\lambda_{l}{{E\left( {{f\left( T_{i} \right)},{\max\limits_{m}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \right)}.}}}}} & (28)\end{matrix}$

6.0 Optimization Using CCCP

This section generally describes the use of a conventional constrainedconcave-convex quadratic programming (CCCP) method for practicallysolving the three different mathematical optimization problems given byequations (22), (26) and (28).

6.1 MLMI-ILCC Technique Using L1 Loss (Constraints A and C)

This section generally describes an embodiment of how the aforementionedconventional CCCP method can be used to practically solve themathematical optimization problem given by equation (26) for theMLMI-ILCC technique described heretofore. More particularly, byintroducing slack variables, equation (26) can be rewritten as thefollowing constrained minimization equation:

$\begin{matrix}{{{\min\limits_{{f \in H},\delta_{1},\delta_{l},\delta_{l}^{*},{l = 2},\ldots \mspace{11mu},L}{\frac{1}{2}{f}_{H}^{2}}} + {\lambda_{1}\delta_{1}^{T}1} + {\sum\limits_{l = 1}^{L}{\lambda_{l}\left( {{\delta_{l}^{T}1} + {\delta_{l}^{*T}1}} \right)}}}{s.t.\left\{ {\begin{matrix}{{{1 - {y_{i}{f\left( T_{i} \right)}}} \leq \delta_{1i}},} & {{i = 1},\ldots \mspace{11mu},N^{1}} \\{{{{f\left( T_{i} \right)} - {\max\limits_{{m = 1},\ldots \mspace{11mu},N^{l}}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \leq \delta_{li}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}} \\{{{{\max\limits_{{m = 1},\ldots \mspace{11mu},N^{l}}\left\{ {f\left( T_{im}^{l} \right)} \right\}} - {f\left( T_{i} \right)}} \leq \delta_{li}^{*}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}} \\{{0 \leq \delta_{1i}},{0 \leq \delta_{li}},{0 \leq \delta_{li}^{*}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}}\end{matrix},} \right.}} & (29)\end{matrix}$

where δ₁=[δ₁₁,δ₁₂, . . . ,δ_(1N) ₁ ]^(T) is a vector of slack variablesfor the shot layer classification errors, δ_(l)=[δ_(l1),δ_(l2), . . .,δ_(lN) _(l) ]^(T) and δ_(l)*=[δ_(l1)*,δ_(l2), . . . ,δ_(lN) _(l) *]^(T)are vectors of slack variables for the classification inconsistencybetween the l^(th) layer and the root/shot layer, and 1=[1,1, . . . ,1]^(T) is a vector of all ones.

Now define f to be a linear function in the mapped high-dimensionalspace f(X)=W^(T)φ(X)+b, where φ(

) is the mapping function. By ignoring b in ∥f∥_(H) ² as is done in theconventional SVM-SL method described heretofore) and substituting f intoequation (29), equation (29) becomes:

$\begin{matrix}{{{\min\limits_{W,b,\delta_{1},\delta_{l},\delta_{l}^{*},{l = 2},\ldots \mspace{11mu},L}{\frac{1}{2}W^{T}W}} + {\lambda_{1}\delta_{1}^{T}1} + {\sum\limits_{l = 2}^{L}\; {\lambda_{l}\left( {{\delta_{l}^{T}1} + {\delta_{l}^{*T}1}} \right)}}}{s.t.\left\{ {\begin{matrix}{{{1 - {y_{i}\left( {{W^{T}{\varphi \left( T_{i} \right)}} + b} \right)}} \leq \delta_{1i}},} & {{i = 1},\ldots \mspace{11mu},N^{1}} \\{{{{W^{T}{\varphi \left( T_{i} \right)}} - {\max\limits_{{m = 1},\ldots \mspace{11mu},N^{l}}\left\{ {W^{T}{\varphi \left( T_{im}^{l} \right)}} \right\}}} \leq \delta_{li}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}} \\{{{{\max\limits_{{m = 1},\ldots \mspace{11mu},N^{l}}\left\{ {W^{T}{\varphi \left( T_{im}^{l} \right)}} \right\}} - {W^{T}{\varphi \left( T_{i} \right)}}} \leq \delta_{li}^{*}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}} \\{{0 \leq \delta_{1i}},{0 \leq \delta_{li}},{0 \leq \delta_{li}^{*}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}}\end{matrix}.} \right.}} & (30)\end{matrix}$

It is noted that the second and third constraints in equation (30) arenon-linear concave-convex inequalities, and all the other constraintsare linear. Therefore, the CCCP method is well suited to solving themathematical optimization problem in equation (30). By employing thefollowing sub-gradient of the max function in equation (30):

$\begin{matrix}{{{\partial\left( {\max\limits_{{m = 1},\ldots \mspace{11mu},N^{l}}\left\{ {W^{T}{\varphi \left( T_{im}^{l} \right)}} \right\}} \right)} = {\sum\limits_{m = 1}^{N^{l}}{\beta_{im}^{l}{\varphi \left( T_{im}^{l} \right)}}}},} & (31)\end{matrix}$

where

$\begin{matrix}{\beta_{im}^{l} = \left\{ \begin{matrix}0 & {{{{if}\mspace{14mu} W^{T}{\varphi \left( T_{im}^{l} \right)}} \neq {\max\limits_{r}\left\{ {W^{T}{\varphi \left( T_{ir}^{l} \right)}} \right\}}},} \\{1/R} & {otherwise}\end{matrix} \right.} & (32)\end{matrix}$

and where R is the number of sub-structures with maximal response,equation (30) can be solved in an iterative fashion by fixing W and β inturn until W converges. More particularly, when fixing W equation (32)is solved, and then when fixing β the following equation is solved:

$\begin{matrix}{{{\min\limits_{W,b,\delta_{1},\delta_{l},\delta_{l}^{*},{l = 2},\ldots \mspace{11mu},L}{\frac{1}{2}W^{T}W}} + {\lambda_{1}\delta_{1}^{T}1} + {\sum\limits_{l = 2}^{L}\; {\lambda_{l}\left( {{\delta_{l}^{T}1} + {\delta_{l}^{*T}1}} \right)}}}{s.t.\left\{ {\begin{matrix}{{{1 - {y_{i}\left( {{W^{T}{\varphi \left( T_{i} \right)}} + b} \right)}} \leq \delta_{1i}},} & {{i = 1},\ldots \mspace{11mu},N^{1}} \\{{{{W^{T}{\varphi \left( T_{i} \right)}} - {\sum\limits_{m = 1}^{N^{l}}\; {\beta_{im}^{l}W^{T}{\varphi \left( T_{im}^{l} \right)}}}} \leq \delta_{li}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = {2\ldots \; L}}} \\{{{{\sum\limits_{m = 1}^{N^{l}}\; {\beta_{im}^{l}W^{T}{\varphi \left( T_{im}^{l} \right)}}} - {W^{T}{\varphi \left( T_{i} \right)}}} \leq \delta_{li}^{*}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}} \\{{0 \leq \delta_{1i}},{0 \leq \delta_{li}},{0 \leq \delta_{li}^{*}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}}\end{matrix}.} \right.}} & (33)\end{matrix}$

However, equation (33) cannot be solved directly since W lies in themapped feature space which usually goes infinite. In order to addressthis issue, the explicit usage of W can be removed by forming a dualmathematical optimization problem as follows. Introducing the followingLagrange multiplier coefficients:

α=┌α₁ ¹, . . . ,α_(N) ₁ ¹,α₁ ², . . . ,α_(N) ₁ ²,α₁*², . . . ,α_(N) ₁*², . . . ,α₁ ^(L), . . . ,α_(N) ₁ ^(L),α₁*^(L), . . . ,α_(N) ₁*^(L)┘^(T)   (34)

into the constraints A and C results in the following dual formulationequation according to the conventional Karush-Kuhn-Tucker (KKT) theorem:

$\begin{matrix}{{{\min\limits_{\alpha}{\frac{1}{2}\alpha^{T}Q\; \alpha}} + {p^{T}\alpha}}{s.t.\left\{ {{\begin{matrix}{{Y^{T}\alpha} = 0} \\{0 \leq \alpha \leq \Lambda}\end{matrix}{and}\mspace{14mu} {the}\mspace{20mu} {equality}\mspace{14mu} W} = {\sum\limits_{i = 1}^{}\; {\left( {A\; \alpha} \right)_{i} \times \varphi }}} \right.}} & (35)\end{matrix}$

where α,p,Y, and Λ are

=(2L−1)×N¹ dimensional vectors, and p,Y, and Λ have entrances given bythe equations:

$\begin{matrix}{p_{i} = \left\{ {\begin{matrix}{- 1} & {1 \leq i \leq N^{1}} \\0 & {otherwise}\end{matrix},{Y_{i} = \left\{ {\begin{matrix}y_{i} & {1 \leq i \leq N^{1}} \\0 & {otherwise}\end{matrix},{and}} \right.}} \right.} & (36) \\{\Lambda_{i} = \left\{ {\begin{matrix}\lambda_{1} & {1 \leq i \leq N^{1}} \\\lambda_{l} & {{{{{\left( {{2l} - 3} \right)N^{1}} + 1} \leq i \leq {\left( {{2l} - 1} \right)N^{1}}};{l = 2}},\ldots \mspace{11mu},L}\end{matrix}.} \right.} & (37)\end{matrix}$

In equation (35), Q=A^(T)KA is the Gram matrix with K being a kernelmatrix and A being a sparse matrix of size

, where

=

is the overall number of nodes in the training set, and

is the dimension of α. Intuitively, A can be regarded as amulti-instance transform matrix that represents the inter-layerinconsistency penalty constraint C to the hyper-plane, where A is givenby the equation:

$\begin{matrix}{A_{IJ} = \left\{ {\begin{matrix}y_{I} & {I,{J \in \left\lbrack {1,N^{1}} \right\rbrack}} \\{- 1} & {{I \in \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{2l} - 3} \right)N^{1}} + 1},{\left( {{2l} - 2} \right)N^{1}}} \right\rbrack};{2 \leq l \leq L}}} \\1 & {{I \in \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{2l} - 2} \right)N^{1}} + 1},{\left( {{2l} - 1} \right)N^{1}}} \right\rbrack};{2 \leq l \leq L}}} \\\beta_{I} & \begin{matrix}{{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{2l} - 3} \right)N^{1}} + 1},{\left( {{2l} - 2} \right)N^{1}}} \right\rbrack};}}} \\{l\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} (I)}\end{matrix} \\{- \beta_{I}} & \begin{matrix}{{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{2l} - 2} \right)N^{1}} + 1},{\left( {{2l} - 1} \right)N^{1}}} \right\rbrack};}}} \\{l\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} (I)}\end{matrix} \\0 & {otherwise}\end{matrix},} \right.} & (38)\end{matrix}$

where β_(I) is a prescribed coefficient for node set

(I), and β_(I) corresponds to β_(im) ^(l) in equation (31).

Eventually, equation (26) becomes a modified learned objective decisionfunction given by the equation:

$\quad\begin{matrix}\begin{matrix}{{f(x)} = {{W^{T}{\varphi (x)}} + b}} \\{= {{\left( {\sum\limits_{i = 1}^{}\; {\left( {A\; \alpha} \right)_{i} \times {\varphi \left( {(i)} \right)}}} \right)^{T}{\varphi (x)}} + b}} \\{= {{{k^{T}(x)}A\; \alpha} + {b.}}}\end{matrix} & (39)\end{matrix}$

Then, in the same manner as described heretofore for the SVM-MLMIKtechnique, the modified learned objective decision function of equation(39) can be improved by substituting the MLMI kernel of equation (12)for the kernel k^(T)(x), resulting in an improved modified learnedobjective decision function f′(x) given by the equation:

f′(x)=k _(MLMI)(x _(i) ,x)Aα+b.   (40)

In tested embodiments of the MLMI-ILCC technique λ₁ was specified to bethe set of values {2⁻²,2⁻¹, . . . ,2⁵}. Additionally, λ₂=λ₃= . . .=λ_(L) were specified to be the set of values {10⁻³,10⁻²,10⁻¹,1}.

6.2 MLMI-FLCE Technique Using L1 Loss (Constraints A and B)

This section generally describes an embodiment of how the aforementionedCCCP method can be used to practically solve the mathematicaloptimization problem given by equation (22) for the MLMI-FLCE techniquedescribed heretofore. An approach similar to that for the MLMI-ILCCtechnique just described is used to solve this optimization problem.More particularly, by introducing slack variables, equation (22) can berewritten as the following constrained minimization equation:

$\begin{matrix}{{{\min\limits_{{f \in H},\delta_{1},\delta_{l},\delta_{l}^{*},{l = 2},\ldots \mspace{11mu},L}{\frac{1}{2}{f}_{H}^{2}}} + {\lambda_{1}\delta_{1}^{T}1} + {\sum\limits_{l = 2}^{L}\; {\lambda_{l}\delta_{l}^{T}1}}}{s.t.\left\{ {\begin{matrix}{{{1 - {y_{i}^{f}\left( T_{i} \right)}} \leq \delta_{1i}},} & {{i = 1},\ldots \mspace{11mu},N^{1}} \\{{{{- y_{i}} \times {\max\limits_{{m = 1},\ldots \mspace{11mu},N^{l}}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \leq \delta_{li}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}} \\{{0 \leq \delta_{1i}},{0 \leq \delta_{li}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}}\end{matrix},} \right.}} & (41)\end{matrix}$

where δ₁, δ_(l), and δ_(l)* are defined the same as for the MLMI-ILCCtechnique.

The CCCP method can be employed in the same iterative manner justdescribed in detail above for the MLMI-ILCC technique in order to solvethe mathematical optimization problem in an iterative manner andeventually derive an improved modified learned objective decisionfunction for the MLMI-FLCE technique similar to equation (40). However,it is noted that in this case the variables in equation (35) differ asfollows compared to the definitions provided for the MLMI-ILCCtechnique. α,p,Y, and Λ are

=L×N¹ dimensional vectors with entrances given by the equations:

$\begin{matrix}{{\alpha = \left\lbrack {\alpha_{1}^{1},\ldots \mspace{11mu},\alpha_{N^{1}}^{1},\alpha_{1}^{2},\ldots \mspace{11mu},\alpha_{N^{1}}^{2},\ldots \mspace{11mu},\alpha_{1}^{L},\ldots \mspace{11mu},\alpha_{N^{1}}^{L}} \right\rbrack^{T}},} & (42) \\{p_{i} = \left\{ {\begin{matrix}{- 1} & {1 \leq i \leq N^{1}} \\0 & {otherwise}\end{matrix},{Y_{i} = y_{j\% N^{1}}},} \right.} & (43) \\{\Lambda_{i} = \left\{ {\begin{matrix}\lambda_{1} & {1 \leq i \leq N^{1}} \\\lambda_{l} & {{{{{\left( {l - 1} \right) \times N^{1}} + 1} \leq i \leq {l \times N^{1}}};{l = 2}},\ldots \mspace{11mu},L}\end{matrix},} \right.} & (44)\end{matrix}$

and A is a multi-instance transform matrix given by the equation:

$\begin{matrix}{A_{IJ} = \left\{ \begin{matrix}y_{I} & {I = {J \in \left\lbrack {1,N^{1}} \right\rbrack}} \\{y_{J\mspace{11mu} \% \mspace{11mu} N^{1}} \times \beta_{I}} & \begin{matrix}{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {l - 1} \right) \times N^{1}} + 1},{l \times N^{1}}} \right\rbrack};}} \\{l{\mspace{11mu} \;}{is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} {(I).}}\end{matrix} \\0 & {otherwise}\end{matrix} \right.} & (45)\end{matrix}$

In tested embodiments of the MLMI-FLCE technique, σ was specified tovary from 1 to 15 with a step size of 2, and λ₁ was specified to be theset of values {2⁻²,2⁻¹, . . . 2⁵}. Additionally, λ₂=λ₃= . . . =λ_(L)were specified to be the set of values {10⁻³, 10⁻², 10⁻¹, 1}.

6.3 MLMI-FLCE-ILCC Technique Using L1 Loss (Constraints A, B and C)

This section generally describes an embodiment of how the aforementionedCCCP method can be used to practically solve the mathematicaloptimization problem given by equation (28) for the MLMI-FLCE-ILCCtechnique described heretofore. An approach similar to that for theMLMI-ILCC technique described heretofore is used to solve thisoptimization problem. More particularly, by introducing slack variables,equation (28) can be rewritten as the following constrained minimizationequation:

$\begin{matrix}{{{\min\limits_{{f \in H},\delta_{1},\delta_{l},\delta_{l}^{*},{l = 2},\; \ldots \mspace{11mu},L}{\frac{1}{2}{f}_{H}^{2}}} + {\lambda_{1}\delta_{1}^{T}1} + {\sum\limits_{l = 2}^{L}\; {{\overset{\sim}{\lambda}}_{l}{\overset{\sim}{\delta}}_{l}^{T}1}} + {\sum\limits_{l = 2}^{L}\; {\lambda_{l}\left( {{\delta_{l}^{T}1} + {\delta_{l}^{*T}1}} \right)}}}{s.t.\left\{ {\begin{matrix}{{{1 - {y_{i}{f\left( T_{i} \right)}}} \leq \delta_{1i}},} & {{i = 1},\ldots \mspace{11mu},N^{1}} \\{{{{- y_{i}} \times {\max\limits_{{m = 1},\ldots \mspace{11mu},N^{l}}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \leq {\overset{\sim}{\delta}}_{li}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}} \\{{{{f\left( T_{i} \right)} - {\max\limits_{{m = 1},\ldots \mspace{11mu},N^{l}}\left\{ {f\left( T_{im}^{l} \right)} \right\}}} \leq \delta_{li}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}} \\{{{{\max\limits_{{m = 1},\ldots \mspace{11mu},N^{l}}\left\{ {f\left( T_{im}^{l} \right)} \right\}} - {f\left( T_{i} \right)}} \leq \delta_{li}^{*}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}} \\{{0 \leq \delta_{1i}},{0 \leq {\overset{\sim}{\delta}}_{li}},{0 \leq \delta_{li}},{0 \leq \delta_{li}^{*}},} & {{i = 1},\ldots \mspace{11mu},N^{1},{l = 2},{\ldots \mspace{11mu} L}}\end{matrix},} \right.}} & (46)\end{matrix}$

where δ₁, δ_(l), and δ_(l)* are defined the same as for the MLMI-ILCCtechnique.

The CCCP method can be employed in the same iterative manner describedin detail heretofore for the MLMI-ILCC technique in order to solve themathematical optimization problem in an iterative manner and eventuallyderive an improved modified learned objective decision function for theMLMI-FLCE-ILCC technique similar to equation (40). However, it is notedthat in this case the variables in equation (35) differ as followscompared to the definitions provided for the MLMI-ILCC technique. α,p,Y,and Λ are

=(3L−2)×N¹ dimensional vectors with entrances given by the equations:

$\begin{matrix}{{\alpha = \left\lceil \begin{matrix}{\alpha_{1}^{1},\ldots \mspace{11mu},\alpha_{N^{1}}^{1},{\overset{\sim}{\alpha}}_{1}^{2},\ldots \mspace{11mu},{\overset{\sim}{\alpha}}_{N^{1}}^{2},\ldots \mspace{11mu},\alpha_{1}^{2},\ldots \mspace{11mu},\alpha_{N^{1}}^{*2},\alpha_{1}^{*2},\ldots \mspace{11mu},} \\{{\overset{\sim}{\alpha}}_{1}^{L},\ldots \mspace{11mu},{\overset{\sim}{\alpha}}_{N^{1}}^{L},\alpha_{1}^{L},\ldots \mspace{11mu},\alpha_{N^{1}}^{L},\alpha_{1}^{*L},\ldots \mspace{11mu},\alpha_{N^{1}}^{*L}}\end{matrix} \right\rceil^{T}},} & (47) \\{p_{i} = \left\{ {\begin{matrix}{- 1} & {1 \leq i \leq N^{1}} \\0 & {otherwise}\end{matrix},{Y_{i} = \left\{ {\begin{matrix}y_{i} & {1 \leq i \leq N^{1}} \\y_{i\mspace{11mu} \% \mspace{11mu} N^{1}} & {{N^{1} + 1} \leq i \leq {L \times N^{1}}} \\0 & {otherwise}\end{matrix},} \right.}} \right.} & (48) \\{\Lambda_{i} = \left\{ {\begin{matrix}\lambda_{1} & {1 \leq i \leq N^{1}} \\{\overset{\sim}{\lambda}}_{l} & {{{{{\left( {{3l} - 5} \right)N^{1}} + 1} \leq i \leq {\left( {{3l} - 4} \right)N^{1}}};{l = 2}},\ldots \mspace{11mu},L} \\\lambda_{l} & {{{{{\left( {{3l} - 4} \right)N^{1}} + 1} \leq i \leq {\left( {{3l} - 2} \right)N^{1}}};{l = 2}},\ldots \mspace{11mu},L}\end{matrix},} \right.} & (49)\end{matrix}$

and A is a multi-instance transform matrix given by the equation:

$\begin{matrix}{A_{IJ} = \left\{ \begin{matrix}y_{T} & {I,{J \in \left\lbrack {1,N^{1}} \right\rbrack}} \\{- 1} & {{I \in \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{3l} - 4} \right)N^{1}} + 1},{\left( {{3l} - 3} \right)N^{1}}} \right\rbrack};{2 \leq l \leq L}}} \\1 & {{I \in \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{3l} - 3} \right)N^{1}} + 1},{\left( {{3l} - 2} \right)N^{1}}} \right\rbrack};{2 \leq l \leq L}}} \\{y_{J\mspace{11mu} \% \mspace{11mu} N^{1}} \times \beta_{I}} & \begin{matrix}{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{3l} - 5} \right)N^{1}} + 1},{\left( {{3l} - 4} \right)N^{1}}} \right\rbrack};}} \\{l\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} {(I).}}\end{matrix} \\\beta_{I} & \begin{matrix}{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{3l} - 4} \right)N^{1}} + 1},{\left( {{3l} - 3} \right)N^{1}}} \right\rbrack};}} \\{l\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} (I)}\end{matrix} \\{- \beta_{I}} & \begin{matrix}{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{3l} - 3} \right)N^{1}} + 1},{\left( {{3l} - 2} \right)N^{1}}} \right\rbrack};}} \\{l\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} (I)}\end{matrix} \\0 & {otherwise}\end{matrix} \right.} & (50)\end{matrix}$

In tested embodiments of the MLMI-FLCE-ILCC technique, σ was specifiedto vary from 1 to 15 with a step size of 2, and λ₁ was specified to bethe set of values {2²,2¹, . . . 2⁵}. Additionally, λ₂=λ₃= , . . .=λ_(L), {tilde over (λ)}₂={tilde over (λ)}₃= . . . {tilde over (λ)}_(L)were specified to be the set of values {10⁻³,10⁻²,10⁻¹,1}.

6.4 VCD Process Using Regularization Framework

FIG. 6 illustrates an exemplary embodiment, in simplified form, of aregularization framework-based process for performing VCD on a videoclip (herein also termed classifying visual concepts contained withinthe clip) based upon a prescribed set of target concepts. As depicted inFIG. 6, the process starts with segmenting the clip into a plurality ofshots 600. An MLMI structured metadata representation of each shot isthen constructed 602. This representation includes a hierarchy of threelayers. An uppermost shot layer contains the plurality of shots. Anintermediate key-frame sub-layer is located contiguously beneath theshot layer and contains one or more key-frames that have been extractedfrom each shot 604. A lowermost key-region sub-layer is locatedcontiguously beneath the key-frame sub-layer and contains a set offiltered key-regions for each key-frame 606/608. A set of pre-generatedtrained models of the target concepts is validated 610 using a set oftraining shots which are selected from the plurality of shots. An MLMIkernel is recursively generated 612, where this kernel models the MLMIstructured metadata representation of each shot by comparing prescribedpairs of shots. A regularization framework is then utilized inconjunction with the MLMI kernel to generate modified learned objectivedecision functions 614 corresponding to either the MLMI-ILCC, MLMI-FLCEor MLMI-FLCE-ILCC techniques, where this decision function learns aclassifier for determining if a particular shot, that is not in the setof training shots, contains instances of the target concepts.

7.0 Computing Environment

This section provides a brief, general description of a suitablecomputing system environment in which portions of the VCD techniqueembodiments described herein can be implemented. These VCD techniqueembodiments are operational with numerous general purpose or specialpurpose computing system environments or configurations. Exemplary wellknown computing systems, environments, and/or configurations that can besuitable include, but are not limited to, personal computers (PCs),server computers, hand-held or laptop devices, multiprocessor systems,microprocessor-based systems, set top boxes, programmable consumerelectronics, network PCs, minicomputers, mainframe computers,distributed computing environments that include any of theaforementioned systems or devices, and the like.

FIG. 7 illustrates a diagram of an exemplary embodiment, in simplifiedform, of a suitable computing system environment according to the VCDtechnique embodiments described herein. The environment illustrated inFIG. 7 is only one example of a suitable computing system environmentand is not intended to suggest any limitation as to the scope of use orfunctionality of the VCD technique embodiments described herein. Neithershould the computing system environment be interpreted as having anydependency or requirement relating to any one or combination ofcomponents exemplified in FIG. 7.

As illustrated in FIG. 7, an exemplary system for implementing the VCDtechnique embodiments described herein includes one or more computingdevices, such as computing device 700. In its simplest configuration,computing device 700 typically includes at least one processing unit 702and memory 704. Depending on the specific configuration and type ofcomputing device, the memory 704 can be volatile (such as RAM),non-volatile (such as ROM and flash memory, among others) or somecombination of the two. This simplest configuration is illustrated bydashed line 706.

As exemplified in FIG. 7, computing device 700 can also have additionalfeatures and functionality. By way of example, computing device 700 caninclude additional storage such as removable storage 708 and/ornon-removable storage 710. This additional storage includes, but is notlimited to, magnetic disks, optical disks and tape. Computer storagemedia typically embodies volatile and non-volatile media, as well asremovable and non-removable media implemented in any method ortechnology. The computer storage media provides for storage of variousinformation required to operate the device 700 such as computer readableinstructions associated with an operating system, application programsand other program modules, and data structures, among other things.Memory 704, removable storage 708 and non-removable storage 710 are allexamples of computer storage media. Computer storage media includes, butis not limited to, RAM, ROM, EEPROM, flash memory or other memorytechnology, CD-ROM, digital versatile disks (DVD) or other optical diskstorage technology, magnetic cassettes, magnetic tape, magnetic diskstorage or other magnetic storage devices, or any other medium which canbe used to store the desired information and which can be accessed bycomputing device 700. Any such computer storage media can be part ofcomputing device 700.

As exemplified in FIG. 6, computing device 700 also includes acommunications connection(s) 712 that allows the device to operate in anetworked environment and communicate with a remote computing device(s),such as remote computing device(s) 718. Remote computing device(s) 718can be a PC, a server, a router, a peer device, or other common networknode, and typically includes many or all of the elements describedherein relative to computing device 700. Communication between computingdevices takes place over a network(s) 720, which provides a logicalconnection(s) between the computing devices. The logical connection(s)can include one or more different types of networks including, but notlimited to, a local area network(s) (LAN) and wide area network(s)(WAN). Such networking environments are commonplace in conventionaloffices, enterprise-wide computer networks, intranets and the Internet.It will be appreciated that the communications connection(s) 712 andrelated network(s) 720 described herein are exemplary and other means ofestablishing communication between the computing devices can be used.

As exemplified in FIG. 7, communications connection(s) 712 and relatednetwork(s) 720 are an example of communication media. Communicationmedia typically embodies computer-readable instructions, datastructures, program modules or other data in a modulated data signalsuch as a carrier wave or other transport mechanism and includes anyinformation delivery media. The term “modulated data signal” means asignal that has one or more of its characteristics set or changed insuch a manner as to encode information in the signal. By way of example,but not limitation, communication media includes wired media such as awired network or direct-wired connection, and wireless media such asacoustic, radio frequency (RF), infrared, frequency modulation (FM)radio and other wireless media. The term “computer-readable medium” asused herein includes both the aforementioned storage media andcommunication media.

As exemplified in FIG. 7, computing device 700 also includes an inputdevice(s) 714 and output device(s) 716. Exemplary input devices 714include, but are not limited to, a keyboard, mouse, pen, touch inputdevice, microphone, and camera, among others. A user can enter commandsand various types of information into the computing device 700 throughthe input device(s) 714. Exemplary output devices 716 include, but arenot limited to, a display device(s), a printer, and audio outputdevices, among others. These input and output devices are well known andneed not be described at length here.

The VCD technique embodiments described herein can be further describedin the general context of computer-executable instructions, such asprogram modules, which are executed by computing device 700. Generally,program modules include routines, programs, objects, components, anddata structures, among other things, that perform particular tasks orimplement particular abstract data types. The VCD technique embodimentscan also be practiced in a distributed computing environment where tasksare performed by one or more remote computing devices 718 that arelinked through a communications network 712/720. In a distributedcomputing environment, program modules can be located in both local andremote computer storage media including, but not limited to, memory 704and storage devices 708/710.

8.0 Additional Embodiments

While the VCD technique has been described in detail by specificreference to embodiments thereof, it is understood that variations andmodifications thereof can be made without departing from the true spiritand scope of the technique. It is also noted that any or all of theaforementioned embodiments can be used in any combination desired toform additional hybrid embodiments. Although the VCD techniqueembodiments have been described in language specific to structuralfeatures and/or methodological acts, it is to be understood that thesubject matter defined in the appended claims is not necessarily limitedto the specific features or acts described heretofore. Rather, thespecific features and acts described heretofore are disclosed as exampleforms of implementing the claims.

1. A computer-implemented process for classifying visual conceptscontained within a video clip based upon a prescribed set of targetconcepts, comprising process actions of: segmenting the clip into aplurality of shots, wherein each shot comprises a series of consecutiveframes that represent a distinctive coherent visual theme; constructinga multi-layer multi-instance (MLMI) structured metadata representationof each shot; validating a set of pre-generated trained models of thetarget concepts using a set of training shots selected from theplurality of shots; recursively generating an MLMI kernel k_(MLMI)( )which models the MLMI structured metadata representation of each shot bycomparing prescribed pairs of shots; utilizing k_(MLMI)( ) to generate alearned objective decision function f( ) which learns a classifier fordetermining if a particular shot x, that is not in the set of trainingshots, comprises instances of the target concepts.
 2. The process ofclaim 1, wherein the MLMI structured metadata representation of eachshot comprises: a hierarchy of three layers; and a rooted treestructure, comprising a connected acyclic directed graph of nodes,wherein, each node n is connected via a unique path to a root node, eachnode n comprises structured metadata of a certain granularity describinga particular visual concept, the granularity of the metadata increasesfor each successive layer down the hierarchy, and each layer comprises anode pattern group G_(l) comprising all the nodes in the layer.
 3. Theprocess of claim 2, wherein the process action of constructing an MLMIstructured metadata representation of each shot comprises actions of:extracting one or more key-frames from each shot, wherein each key-framecomprises one or more of the target concepts; segmenting each key-frameinto a plurality of key-regions, wherein each key-region comprises aparticular target concept; and filtering the plurality of key-regionsfor each key-frame to filter out those that are smaller than aprescribed size, thus creating a set of filtered key-regions for eachkey-frame.
 4. The process of claim 3, wherein, the action of extractingone or more key-frames from each shot is performed using a TRECVID (TextREtrieval Conference (TREC) Video Retrieval Evaluation) organizermethod, and the action of segmenting each key-frame into a plurality ofkey-regions is performed using a J-value Segmentation (JSEG) method. 5.The process of claim 3, wherein the MLMI structured metadatarepresentation of each shot further comprises: a layer indicator l; anuppermost shot layer, l=1, comprising the root node and the plurality ofshots segmented from the clip; an intermediate key-frame sub-layer, l=2,contiguously beneath the shot layer, comprising the one or morekey-frames extracted from each shot; and a lowermost key-regionsub-layer, l=3, contiguously beneath the key-frame sub-layer, comprisingthe set of filtered key-regions for each key-frame, wherein, a pluralityof low-level feature descriptors f_(n) are prescribed for each node nwithin each layer which describe the visual concepts contained withinthe layer.
 6. The process of claim 5, wherein, the feature descriptorsf_(n) prescribed for nodes within the shot layer comprise camera motion,object motion and text, the feature descriptors f_(n) prescribed fornodes within the key-frame sub-layer comprise color histogram, colormoment and texture, and the feature descriptors f_(n) prescribed fornodes within the key-region sub-layer comprise object shape, object sizeand object color.
 7. The process of claim 5, wherein bag-instancecorrespondences exist both within each layer as well as betweencontiguous layers in the hierarchy.
 8. The process of claim 5 whereinthe process action of recursively generating an MLMI kernel k_(MLMI)( )which models the MLMI structured metadata representation of each shot,comprises actions of: (a) inputting two particular shots T and T′; (b)initializing k_(MLMI)(T,T′) to zero, wherein k_(MLMI)(T,T′) compares Tand T′ to determine a degree of similarity there-between; (c)initializing the layer indicator l to three; (d) whenever l is greaterthan zero, for each node n in the node pattern group G_(l) in T, and foreach node n′ in the node pattern group G_(l)′ in T′, whenever n or n′are on a lowermost leaf layer of G_(l) or G_(l)′ respectively, computinga kernel k_({circumflex over (N)})({circumflex over (n)},{circumflexover (n)}′) comparing a node pattern {circumflex over (n)} of n and anode pattern {circumflex over (n)}′ of n′ ask_({circumflex over (N)})({circumflex over (n)},{circumflex over(n)}′)=k_(f)(f_(n),f_(n)′), wherein, f_(n) and f_(n)′ are the featuredescriptors prescribed for n and n′ respectively, k_(f)(f_(n),f_(n)′) isa feature-space kernel given by the equationk_(f)(f_(n),f_(n)′)=exp(|f_(n)−f_(n)′|² l 2σ²), wherein σ is aprescribed coefficient, and the node patterns {circumflex over (n)} and{circumflex over (n)}′ comprise all the metadata associated with n andn′ respectively, and whenever n or n′ are not on the lowermost leaflayer of G_(l) or G_(l)′ respectively, computingk_({circumflex over (N)})({circumflex over (n)},{circumflex over (n)}′)as${{k_{\hat{N}}\left( {\hat{n},{\hat{n}}^{\prime}} \right)} = {{k_{f}\left( {f_{n},f_{n}^{\prime}} \right)} \times {\sum\limits_{{\hat{c} \in s_{n}},{{\hat{c}}^{\prime} \in s_{n^{\prime}}^{\prime}}}{k_{\hat{N}}\left( {\hat{c},{\hat{c}}^{\prime}} \right)}}}},$wherein s_(n) and s_(n)′, are sets of node patterns whose parent nodesare n and n′ respectively; (e) updating k_(MLMI)(T,T′) by computingk_(MLMI)(T,T′)_(UD)=k_(MLMI)(T,T′)+k_({circumflex over (N)})({circumflexover (n)},{circumflex over (n)}′) and then computingk_(MLMI)(T,T′)=k_(MLMI)(T,T′)_(UD); (f) decrementing l by one; and (g)repeating actions (d)-(f) until l equals zero.
 9. The process of claim8, further comprising an action of normalizing k_(MLMI)(T,T′) whenever lequals zero, wherein the normalized k_(MLMI)(T,T′) is given by theequation${k_{MLMI}\left( {T,T^{\prime}} \right)}_{NORM} = {\frac{k_{MLMI}\left( {T,T^{\prime}} \right)}{\sqrt{k_{MLMI}\left( {T,T} \right)} \times \sqrt{k_{MLMI}\left( {T^{\prime},T^{\prime}} \right)}}.}$10. The process of claim 1, wherein, given a structured metadata inputspace X comprising a set of J training shots x_(i), and related instanceclassification labels y_(i) to the target concepts for x_(i), whereinx_(i) and y_(i) are given by the equation (x₁,y₁), . . . ,(x_(J),y_(J))∈ X×Y, Y={−1,1}, the learned objective decision function f( ) whichlearns a classifier for determining if a particular shot x includesinstances of the target concepts is given by the equation${{f(x)} = {{sign}\left( {{\sum\limits_{{i = 1},\ldots \mspace{11mu},J}{y_{i}\alpha_{i}{k_{MLMI}\left( {x_{i},x} \right)}}} + b} \right)}},$wherein α is a prescribed coefficient which is optimized using a gridsearch method, k_(MLMI)(x_(i),x) is the MLMI kernel which compares thetraining shots x_(i) to x, and b is a prescribed bias coefficient. 11.The process of claim 1, wherein the MLMI structured metadatarepresentation of each shot comprises a hierarchy of two layers, saidhierarchy comprising, an uppermost shot layer comprising the pluralityof shots segmented from the clip, and a key-frame sub-layer,contiguously beneath the shot layer, comprising one or more key-framesfor each shot, wherein each key-frame comprises one or more of thetarget concepts.
 12. A computer-implemented process for performing videoconcept detection on a video clip based upon a prescribed set of targetconcepts, comprising process actions of: segmenting the clip into aplurality of shots, wherein each shot comprises a series of consecutiveframes that represent a distinctive coherent visual theme; constructinga multi-layer multi-instance (MLMI) structured metadata representationof each shot, comprising, a layer indicator l, a hierarchy of threelayers, said hierarchy comprising, an uppermost shot layer, l=1,comprising the plurality of shots segmented from the clip, anintermediate key-frame sub-layer, l=2, contiguously beneath the shotlayer, comprising one or more key-frames for each shot, wherein eachkey-frame comprises one or more of the target concepts, and a lowermostkey-region sub-layer, l=3, contiguously beneath the key-frame sub-layer,comprising a set of filtered key-regions for each key-frame, whereineach filtered key-region comprises a particular target concept, and arooted tree structure, comprising a connected acyclic directed graph ofnodes, wherein each node comprises structured metadata of a certaingranularity describing a particular visual concept, and the granularityof the metadata increases for each successive layer down the hierarchy;validating a set of pre-generated trained models of the target conceptsusing a set of training shots selected from the plurality of shots;recursively generating an MLMI kernel k_(MLMI)( ) which models the MLMIstructured metadata representation of each shot by comparing prescribedpairs of shots; utilizing a regularization framework in conjunction withk_(MLMI)( ) to generate a modified learned objective decision functionf( ) which learns a classifier for determining if a particular shot x,that is not in the set of training shots, comprises instances of thetarget concepts, wherein the regularization framework introducesexplicit constraints which serve to restrict instance classification inthe key-frame and key-region sub-layers, thus maximizing the precisionof the classifier.
 13. The process of claim 12, wherein the explicitconstraints introduced by the regularization framework comprise: aconstraint A comprising a ground truth for the target concepts andinstance classification labels for the plurality of shots in the shotlayer, said constraint A serving to minimize instance classificationerrors for said shots; and a constraint B comprising the ground truth,instance classification labels for the key-frames in the key-framesub-layer, and instance classification labels for the sets of filteredkey-regions in the key-region sub-layer, said constraint B serving tominimize instance classification errors for said key-frames and saidfiltered key-regions.
 14. The process of claim 13, wherein, given a setof structured metadata comprising a set of training shots x_(i) andrelated instance classification labels y_(i) to the target concepts forx_(i), wherein, x_(i) and y_(i) are given by the equation{(x_(i),y_(i))}_(i=1) ^(N) ¹ , and N¹ is the total number of trainingshots x_(i), and the set of training shots x_(i) comprises a set

of sequentially indexed nodes given by the equation:{x₁, …  , x_(N¹), x₁₁², …  , x_(N¹N_(N¹)¹), x₁₁³, …  , x_(N¹N_(N¹)³)³},wherein x_(im) ^(l) is an m^(th) sub-structure in the l^(th) layer forshot x_(i), and N_(i) ^(l) is the number of sub-structures in the l^(th)layer for shot x_(i), the modified learned objective decision functionf( ) for the particular shot x is given by the equationf(x)=k_(MLMI)(x_(i),x)Aa+b, wherein, a is a vector of prescribedLagrange multiplier coefficients given by the equation α=[α₁ ¹, . . .,α_(N) ₁ ¹,α₁ ², . . . ,α_(N) ₁ ²,α₁ ³, . . . ,ζ_(N) ₁ ³]^(T),k_(MLMI)(x_(i),x) is the MLMI kernel which compares the training shotsx_(i) to x, A is a multi-instance transform matrix given by the equation$A_{IJ} = \left\{ \begin{matrix}y_{I} & {I = {J \in \left\lbrack {1,N^{1}} \right\rbrack}} \\{y_{J\mspace{11mu} \% \mspace{11mu} N^{1}} \times \beta_{I}} & \begin{matrix}{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {l - 1} \right) \times N^{1}} + 1},{l \times N^{1}}} \right\rbrack};}} \\{{l\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} (I)},}\end{matrix} \\0 & {otherwise}\end{matrix} \right.$ β_(I) is a prescribed coefficient for node set

(I) on layer l, and b is a prescribed bias coefficient.
 15. The processof claim 12, wherein the explicit constraints introduced by theregularization framework comprise: a constraint A comprising a groundtruth for the target concepts and instance classification labels for theplurality of shots in the shot layer, said constraint A serving tominimize instance classification errors for said shots; and a constraintC comprising the instance classification labels for the plurality ofshots in the shot layer, instance classification labels for thekey-frames in the key-frame sub-layer, and instance classificationlabels for the sets of filtered key-regions in the key-region sub-layer,said constraint C serving to minimize an inter-layer inconsistencypenalty which measures consistency between the instance classificationlabels for the plurality of shots, the instance classification labelsfor the key-frames and the instance classification labels for the setsof filtered key-regions.
 16. The process of claim 15, wherein, given aset of structured metadata comprising a set of training shots x_(i) andrelated instance classification labels y_(l) to the target concepts forx_(i), wherein, x_(i) and y_(i) are given by the equation{(x_(i),y_(i))}_(i=1) ^(N) ¹ , and N¹ is the total number of trainingshots x_(i), and the set of training shots x_(i) comprises a set I ofsequentially indexed nodes given by the equationn:{x₁, …  , x_(N¹), x₁₁², …  , x_(N¹N_(N¹)¹)², x₁₁³, …  , x_(N¹N_(N¹)³)³},wherein x_(im) ^(l) is an m^(th) sub-structure in the l^(th) layer forshot x_(i), and N_(i) ^(l) is the number of sub-structures in the l^(th)layer for shot x_(i), the modified learned objective decision functionf( ) for the particular shot x is given by the equationf(x)=k_(MLMI)(x_(i),x)Aa+b, wherein, a is a vector of prescribedLagrange multiplier coefficients given by the equation α=[α₁ ¹, . . .,α_(N) ₁ ¹,α₁ ², . . . ,α_(N) ₁ ²,α₁ ³, . . . ,α_(N) ₁ ³]^(T),k_(MLMI)(x_(i),x) is the MLMI kernel which compares the training shotsx_(i) to x, A is a multi-instance transform matrix given by the equation$A_{IJ} = \left\{ {\begin{matrix}y_{I} & {I,{J \in \left\lbrack {1,N^{1}} \right\rbrack}} \\{- 1} & {{I \in \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{2l} - 3} \right)N^{1}} + 1},{\left( {{2l} - 2} \right)N^{1}}} \right\rbrack};{2 \leq l \leq 3}}} \\1 & {{I \in \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{2l} - 2} \right)N^{1}} + 1},{\left( {{2l} - 1} \right)N^{1}}} \right\rbrack};{2 \leq l \leq 3}}} \\\beta_{I} & \begin{matrix}{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{2l} - 3} \right)N^{1}} + 1},{\left( {{2l} - 2} \right)N^{1}}} \right\rbrack};}} \\{l\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} (I)}\end{matrix} \\{- \beta_{I}} & \begin{matrix}{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{2l} - 2} \right)N^{1}} + 1},{\left( {{2l} - 1} \right)N^{1}}} \right\rbrack};}} \\{l\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} (I)}\end{matrix} \\0 & {otherwise}\end{matrix},} \right.$ β_(I) is a prescribed coefficient for node set

(I) on layer L, and b is a prescribed bias coefficient.
 17. The processof claim 12, wherein the explicit constraints introduced by theregularization framework comprise: a constraint A comprising a groundtruth for the target concepts and instance classification labels for theplurality of shots in the shot layer, said constraint A serving tominimize instance classification errors for said shots; a constraint Bcomprising the ground truth, instance classification labels for thekey-frames in the key-frame sub-layer, and instance classificationlabels for the sets of filtered key-regions in the key-region sub-layer,said constraint B serving to minimize instance classification errors forsaid key-frames and said filtered key-regions; and a constraint Ccomprising the instance classification labels for the plurality of shotsin the shot layer, the instance classification labels for the key-framesin the key-frame sub-layer, and the instance classification labels forthe sets of filtered key-regions in the key-region sub-layer, saidconstraint C serving to minimize an inter-layer inconsistency penaltywhich measures consistency between the instance classification labelsfor the plurality of shots, the instance classification labels for thekey-frames and the instance classification labels for the sets offiltered key-regions.
 18. The process of claim 17, wherein, given a setof structured metadata comprising a set of training shots x_(i) andrelated instance classification labels y_(i) to the target concepts forx_(i), wherein, x_(i) and y_(i) are given by the equation{(x_(i),y_(i))}_(i=1) ^(N) ¹ , and N¹ is the total number of trainingshots x_(i), and the set of training shots x_(i) comprises a set

of sequentially indexed nodes given by the equation:{x₁, …  , x_(N¹), x₁₁², …  , x_(N¹N_(N¹)¹)², x₁₁³, …  , x_(N¹N_(N¹)³)³},wherein x_(im) ^(l) is an m^(th) sub-structure in the l^(th) layer forshot x_(i), and N_(i) ^(l) is the number of sub-structures in the l^(th)layer for shot x_(i), the modified learned objective decision functionf( ) for the particular shot x is given by the equationf(x)=k_(MLMI)(x_(i),x)Aa+b, wherein, a is a vector of prescribedLagrange multiplier coefficients given by the equation α=┌α₁ ¹, . . .,α_(N) ₁ ¹,α₁ ², . . . ,α_(N) ₁ ²,α₁ ³, . . . ,α_(N) ₁ ³┘^(T),k_(MLMI)(x_(i),x) is the MLMI kernel which compares the training shotsx_(i) to x, A is a multi-instance transform matrix given by the equation$A_{IJ} = \left\{ \begin{matrix}y_{I} & {I,{J \in \left\lbrack {1,N^{1}} \right\rbrack}} \\{- 1} & {{I \in \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{3l} - 4} \right)N^{1}} + 1},{\left( {{3l} - 3} \right)N^{1}}} \right\rbrack};{2 \leq l \leq 3}}} \\1 & {{I \in \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{3l} - 3} \right)N^{1}} + 1},{\left( {{3l} - 2} \right)N^{1}}} \right\rbrack};{2 \leq l \leq 3}}} \\{y_{J\mspace{14mu} \% \mspace{11mu} N^{1}} \times \beta_{I}} & \begin{matrix}{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{3l} - 5} \right)N^{1}} + 1},{\left( {{3l} - 4} \right)N^{1}}} \right\rbrack};}} \\{l\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} (I)}\end{matrix} \\\beta_{I} & \begin{matrix}{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{3l} - 4} \right)N^{1}} + 1},{\left( {{3l} - 3} \right)N^{1}}} \right\rbrack};}} \\{l\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} (I)}\end{matrix} \\{- \beta_{I}} & \begin{matrix}{{I \notin \left\lbrack {1,N^{1}} \right\rbrack},{{J \in \left\lbrack {{{\left( {{3l} - 3} \right)N^{1}} + 1},{\left( {{3l} - 2} \right)N^{1}}} \right\rbrack};}} \\{l{\mspace{11mu} \;}{is}\mspace{14mu} {the}\mspace{14mu} {layer}\mspace{14mu} {of}\mspace{14mu} {node}\mspace{14mu} {set}\mspace{14mu} (I)}\end{matrix} \\0 & {otherwise}\end{matrix} \right.$ β_(I) is a prescribed coefficient for node set

(I) on layer l, and b is a prescribed bias coefficient.
 19. The processof claim 18, wherein, constraint A is a Hinge loss function whichmeasures disagreement between the instance classification labels y_(i)and classification of the shots x_(i) in the shot layer, constraint B isa Hinge loss function which measures disagreement between the instanceclassification labels y_(i) and classification of the shots x_(i) in thekey-frame sub-layer and the key-region sub-layer, and constraint C is aninter-layer inconsistency loss function which measures disagreementbetween classification of the shots x_(i) in the shot layer, thekey-frame sub-layer and the key-region sub-layer, wherein, an L1 lossfunction is employed for constraints A, B and C.
 20. Acomputer-implemented process for classifying visual concepts containedwithin a video clip based upon a prescribed set of target concepts,comprising process actions of: segmenting the clip into a plurality ofshots, wherein each shot comprises a series of consecutive frames thatrepresent a distinctive coherent visual theme; constructing amulti-layer multi-instance (MLMI) structured metadata representation ofeach shot, comprising, a layer indicator l, a hierarchy of three layers,said hierarchy comprising, an uppermost shot layer, l=1, comprising theplurality of shots, an intermediate key-frame sub-layer, l=2,contiguously beneath the shot layer, comprising one or more key-framesfor each shot, wherein each key-frame comprises one or more of thetarget concepts, and a lowermost key-region sub-layer, l=3, contiguouslybeneath the key-frame sub-layer, comprising a set of filteredkey-regions for each key-frame, wherein each filtered key-regioncomprises a particular target concept, and a rooted tree structure,comprising a connected acyclic directed graph of nodes, wherein, eachnode n is connected via a unique path to a root node, each node ncomprises structured metadata of a certain granularity describing aparticular visual concept, the granularity of the metadata increases foreach successive layer down the hierarchy, each layer comprises a nodepattern group G_(l) comprising all the nodes in the layer, and aplurality of low-level feature descriptors f_(n) are prescribed for eachnode n within each layer which describe the visual concepts containedwithin the layer; validating a set of pre-generated trained models ofthe target concepts using a set of training shots selected from theplurality of shots; recursively generating an MLMI kernel k_(MLMI)( )which models the MLMI structured metadata representation of each shot bycomparing prescribed pairs of shots, said recursive generationcomprising actions of, (a) inputting two particular shots T and T′, (b)initializing k_(MLMI)(T,T′) to zero, wherein k_(MLMI)(T,T′) compares Tand T′ to determine a degree of similarity there-between, (c)initializing the layer indicator l to three, (d) whenever l is greaterthan zero, for each node n in the node pattern group G_(l) in T, and foreach node n′ in the node pattern group G_(l)′ in T′, whenever n or n′are on a lowermost leaf layer of G_(l) or G_(l)′ respectively, computinga kernel k_({circumflex over (N)})({circumflex over (n)},{circumflexover (n)}′) comparing a node pattern {circumflex over (n)} of n and anode pattern {circumflex over (n)}′ of n′ ask_({circumflex over (N)})({circumflex over (n)},{circumflex over(n)}′)=k_(f)(f_(n),f_(n)′), wherein, f_(n) and f_(n)′ are the featuredescriptors prescribed for n and n′ respectively, k_(f)(f_(n),f_(n)′) isa feature-space kernel given by the equationk_(f)(f_(n),f_(n)′)=exp(|f_(n)−f_(n)′|² l2σ²), wherein σ is a prescribedcoefficient, and the node patterns {circumflex over (n)} and {circumflexover (n)}′ comprise all the metadata associated with n and n′respectively, and whenever n or n′ are not on the lowermost leaf layerof G_(l) or G_(l)′ respectively, computingk_({circumflex over (N)})({circumflex over (n)},{circumflex over (n)}′)as${{k_{\hat{N}}\left( {\hat{n},{\hat{n}}^{\prime}} \right)} = {{k_{f}\left( {f_{n},f_{n}^{\prime}} \right)} \times {\sum\limits_{{\hat{c} \in s_{n}},{{\hat{c}}^{\prime} \in s_{n^{\prime}}^{\prime}}}{k_{\hat{N}}\left( {\hat{c},{\hat{c}}^{\prime}} \right)}}}},$wherein s_(n) and s_(n)′, are sets of node patterns whose parent nodesare n and n′ respectively, (e) updating k_(MLMI)(T,T′) by computingk_(MLMI)(T,T′)_(UD)=k_(MLMI)(T,T′)+k_({circumflex over (N)})({circumflexover (n)},{circumflex over (n)}′) and then computingk_(MLMI)(T,T′)=k_(MLMI)(T,T′)_(UD), (f) decrementing l by one, (g)repeating actions (d)-(f) until l equals zero; and utilizing aregularization framework in conjunction with k_(MLMI)( ) to generate amodified learned objective decision function which learns a classifierfor determining if a particular shot, that is not in the set of trainingshots, comprises instances of the target concepts, wherein theregularization framework introduces explicit constraints which serve tomaximize the precision of the classifier, said constraints comprising, aconstraint A comprising a ground truth for the target concepts andinstance classification labels for the plurality of shots in the shotlayer, said constraint A serving to minimize instance classificationerrors for said shots, and a constraint C comprising the instanceclassification labels for the plurality of shots in the shot layer,instance classification labels for the key-frames in the key-framesub-layer, and instance classification labels for the sets of filteredkey-regions in the key-region sub-layer, said constraint C serving tominimize an inter-layer inconsistency penalty which measures consistencybetween the instance classification labels for the plurality of shots,the instance classification labels for the key-frames and the instanceclassification labels for the sets of filtered key-regions.